Nobody98¶

Taxi and Limousine Commission (TLC) Dataset Analysis¶

Exploratory Data analysis, Regression, Machine Learning¶

Python Pandas, Sklearn, Yellowbrick, XGBoost, Scipy, Seaborn.¶

Data: https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs/data¶

May '23¶

Importing packages, dataset¶

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sb
import plotly.express as px
from ydata_profiling import ProfileReport

import math
from scipy import stats

from statsmodels.formula.api import ols
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, mean_absolute_error,r2_score,mean_squared_error
from yellowbrick.classifier import ClassificationReport, ClassPredictionError

from statsmodels.tools.tools import pinv_extended
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import statsmodels.api as sm

print("Hello World!!")
Hello World!!
In [80]:
data = pd.read_csv(f'C:/Users/Jason/Desktop/Data analysis/TLC Analysis/gvJoe37aS_6HZqnLUmiajw_592fe280dd804403b6e33fd9d9ffa9f1_2017_Yellow_Taxi_Trip_Data.csv')

data.head()
Out[80]:
Unnamed: 0 VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 24870114 2 03/25/2017 8:55:43 AM 03/25/2017 9:09:47 AM 6 3.34 1 N 100 231 1 13.00 0.00 0.50 2.76 0.00 0.30 16.56
1 35634249 1 04/11/2017 2:53:28 PM 04/11/2017 3:19:58 PM 1 1.80 1 N 186 43 1 16.00 0.00 0.50 4.00 0.00 0.30 20.80
2 106203690 1 12/15/2017 7:26:56 AM 12/15/2017 7:34:08 AM 1 1.00 1 N 262 236 1 6.50 0.00 0.50 1.45 0.00 0.30 8.75
3 38942136 2 05/07/2017 1:17:59 PM 05/07/2017 1:48:14 PM 1 3.70 1 N 188 97 1 20.50 0.00 0.50 6.39 0.00 0.30 27.69
4 30841670 2 04/15/2017 11:32:20 PM 04/15/2017 11:49:03 PM 1 4.37 1 N 4 112 2 16.50 0.50 0.50 0.00 0.00 0.30 17.80

Exploratory Data Analysis¶

In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  tolls_amount           22699 non-null  float64
 16  improvement_surcharge  22699 non-null  float64
 17  total_amount           22699 non-null  float64
dtypes: float64(8), int64(7), object(3)
memory usage: 3.1+ MB
In [7]:
data.describe()
Out[7]:
Unnamed: 0 VendorID passenger_count trip_distance RatecodeID PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
count 2.269900e+04 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000
mean 5.675849e+07 1.556236 1.642319 2.913313 1.043394 162.412353 161.527997 1.336887 13.026629 0.333275 0.497445 1.835781 0.312542 0.299551 16.310502
std 3.274493e+07 0.496838 1.285231 3.653171 0.708391 66.633373 70.139691 0.496211 13.243791 0.463097 0.039465 2.800626 1.399212 0.015673 16.097295
min 1.212700e+04 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 -120.000000 -1.000000 -0.500000 0.000000 0.000000 -0.300000 -120.300000
25% 2.852056e+07 1.000000 1.000000 0.990000 1.000000 114.000000 112.000000 1.000000 6.500000 0.000000 0.500000 0.000000 0.000000 0.300000 8.750000
50% 5.673150e+07 2.000000 1.000000 1.610000 1.000000 162.000000 162.000000 1.000000 9.500000 0.000000 0.500000 1.350000 0.000000 0.300000 11.800000
75% 8.537452e+07 2.000000 2.000000 3.060000 1.000000 233.000000 233.000000 2.000000 14.500000 0.500000 0.500000 2.450000 0.000000 0.300000 17.800000
max 1.134863e+08 2.000000 6.000000 33.960000 99.000000 265.000000 265.000000 4.000000 999.990000 4.500000 0.500000 200.000000 19.100000 0.300000 1200.290000
In [9]:
data.columns
Out[9]:
Index(['Unnamed: 0', 'VendorID', 'tpep_pickup_datetime',
       'tpep_dropoff_datetime', 'passenger_count', 'trip_distance',
       'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID',
       'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
       'tolls_amount', 'improvement_surcharge', 'total_amount'],
      dtype='object')

Top 25 observations by total amount paid

In [87]:
data[["tpep_pickup_datetime", "trip_distance", "fare_amount", "total_amount"]].sort_values(by="total_amount", ascending= False)[:25].style.background_gradient()
Out[87]:
  tpep_pickup_datetime trip_distance fare_amount total_amount
8476 2017-02-06 05:50:10 2.600000 999.990000 1200.290000
20312 2017-12-19 09:40:46 0.000000 450.000000 450.300000
13861 2017-05-19 08:20:21 33.920000 200.010000 258.210000
12511 2017-12-17 18:24:24 0.000000 175.000000 233.740000
15474 2017-06-06 20:55:01 0.000000 200.000000 211.800000
6064 2017-06-13 12:30:22 32.720000 107.000000 179.060000
16379 2017-11-30 10:41:11 25.500000 140.000000 157.060000
3582 2017-01-01 23:53:01 7.300000 152.000000 152.300000
11269 2017-06-19 00:51:17 0.000000 120.000000 151.820000
9280 2017-06-18 23:33:25 33.960000 150.000000 150.300000
1928 2017-06-16 18:30:08 12.500000 120.000000 137.800000
10291 2017-09-11 11:41:04 31.950000 131.000000 131.800000
6708 2017-10-30 11:23:46 0.320000 100.000000 126.000000
11608 2017-12-19 17:00:56 23.000000 99.500000 123.300000
908 2017-03-27 13:01:38 26.120000 100.000000 121.560000
7281 2017-01-01 03:02:53 0.000000 100.000000 120.960000
18130 2017-10-26 14:45:01 30.500000 90.500000 119.310000
13621 2017-11-04 13:32:14 19.800000 105.000000 115.940000
13359 2017-01-12 07:19:36 0.000000 75.000000 111.950000
29 2017-11-06 20:30:50 30.830000 80.000000 111.380000
18888 2017-11-04 12:22:33 17.980000 73.500000 110.160000
11839 2017-09-19 16:33:48 2.700000 99.000000 110.000000
11863 2017-08-24 19:44:41 14.100000 80.000000 108.950000
5271 2017-12-07 13:48:52 17.960000 70.000000 107.280000
5536 2017-03-16 12:14:51 17.500000 69.500000 106.600000
In [88]:
data.select_dtypes(include = 'number').sort_values(by="total_amount", ascending= False)[:25].style.background_gradient()
Out[88]:
  Unnamed: 0 VendorID passenger_count trip_distance RatecodeID PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount duration_secs duration_mins
8476 11157412 1 1 2.600000 5 226 226 1 999.990000 0.000000 0.000000 200.000000 0.000000 0.300000 1200.290000 58.000000 0.966667
20312 107558404 2 2 0.000000 5 265 265 2 450.000000 0.000000 0.000000 0.000000 0.000000 0.300000 450.300000 9.000000 0.150000
13861 40523668 2 1 33.920000 5 229 265 1 200.010000 0.000000 0.500000 51.640000 5.760000 0.300000 258.210000 3609.000000 60.150000
12511 107108848 2 1 0.000000 5 265 265 1 175.000000 0.000000 0.000000 46.690000 11.750000 0.300000 233.740000 18.000000 0.300000
15474 55538852 2 1 0.000000 5 265 265 1 200.000000 0.000000 0.500000 11.000000 0.000000 0.300000 211.800000 5.000000 0.083333
6064 49894023 2 1 32.720000 3 138 1 1 107.000000 0.000000 0.000000 55.500000 16.260000 0.300000 179.060000 4049.000000 67.483333
16379 101198443 2 1 25.500000 5 132 265 2 140.000000 0.000000 0.500000 0.000000 16.260000 0.300000 157.060000 3034.000000 50.566667
3582 111653084 1 1 7.300000 5 1 1 1 152.000000 0.000000 0.000000 0.000000 0.000000 0.300000 152.300000 41.000000 0.683333
11269 51920669 1 2 0.000000 5 265 265 1 120.000000 0.000000 0.000000 20.000000 11.520000 0.300000 151.820000 55.000000 0.916667
9280 51810714 2 2 33.960000 5 132 265 2 150.000000 0.000000 0.000000 0.000000 0.000000 0.300000 150.300000 2353.000000 39.216667
1928 51087145 1 2 12.500000 5 211 265 1 120.000000 0.000000 0.000000 5.000000 12.500000 0.300000 137.800000 2922.000000 48.700000
10291 76319330 2 1 31.950000 4 138 265 2 131.000000 0.000000 0.500000 0.000000 0.000000 0.300000 131.800000 2274.000000 37.900000
6708 91660295 2 1 0.320000 5 264 83 1 100.000000 0.000000 0.500000 25.200000 0.000000 0.300000 126.000000 3.000000 0.050000
11608 107690629 2 2 23.000000 3 151 1 1 99.500000 1.000000 0.000000 10.000000 12.500000 0.300000 123.300000 6060.000000 101.000000
908 25075013 2 2 26.120000 4 138 265 1 100.000000 0.000000 0.500000 15.000000 5.760000 0.300000 121.560000 2226.000000 37.100000
7281 111091850 2 1 0.000000 5 265 265 1 100.000000 0.000000 0.500000 20.160000 0.000000 0.300000 120.960000 9.000000 0.150000
18130 90375786 1 1 30.500000 1 132 220 1 90.500000 0.000000 0.500000 19.850000 8.160000 0.300000 119.310000 5268.000000 87.800000
13621 93330154 1 2 19.800000 5 265 230 1 105.000000 0.000000 0.000000 8.000000 2.640000 0.300000 115.940000 2796.000000 46.600000
13359 3055315 1 1 0.000000 5 1 1 1 75.000000 0.000000 0.000000 18.650000 18.000000 0.300000 111.950000 20.000000 0.333333
29 94052446 2 1 30.830000 1 132 23 1 80.000000 0.500000 0.500000 18.560000 11.520000 0.300000 111.380000 12550.000000 209.166667
18888 93297612 2 6 17.980000 3 230 1 1 73.500000 0.000000 0.000000 18.360000 18.000000 0.300000 110.160000 2746.000000 45.766667
11839 78875919 1 4 2.700000 5 231 265 1 99.000000 0.000000 0.000000 10.700000 0.000000 0.300000 110.000000 1952.000000 32.533333
11863 71564944 1 1 14.100000 5 48 265 1 80.000000 0.000000 0.000000 18.150000 10.500000 0.300000 108.950000 2145.000000 35.750000
5271 103571464 2 1 17.960000 3 164 1 1 70.000000 0.000000 0.000000 17.880000 19.100000 0.300000 107.280000 2415.000000 40.250000
5536 21688416 1 1 17.500000 3 164 1 1 69.500000 0.000000 0.000000 21.300000 15.500000 0.300000 106.600000 2265.000000 37.750000
In [ ]:
%matplotlib inline

Create variables¶

Client request: It would be really helpful if you can create meaningful variables by combining or modifying the structures given. A summary of the data visualization

Create target variable - Duration

In [3]:
data.tpep_dropoff_datetime = pd.to_datetime(data.tpep_dropoff_datetime)

data.tpep_pickup_datetime = pd.to_datetime(data.tpep_pickup_datetime)

data["duration_secs"] = (data.tpep_dropoff_datetime - data.tpep_pickup_datetime)/pd.Timedelta(seconds = 1)

data["duration_mins"] = (data.tpep_dropoff_datetime - data.tpep_pickup_datetime)/pd.Timedelta(minutes = 1)
In [4]:
data["week"] = data.tpep_pickup_datetime.dt.isocalendar().week

data["day"] = data.tpep_pickup_datetime.dt.day_name().str.slice(stop = 3)

data["month"] = data.tpep_pickup_datetime.dt.month_name().str.slice(stop = 3)

data["tpep_pickup_time"] = data["tpep_pickup_datetime"].dt.time

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

data.month = pd.Categorical(data.month, categories = months, ordered = True)
data["month_num"] = data.month.cat.codes

data.day = pd.Categorical(data.day, categories = days, ordered = True)
data["day_num"] = data.day.cat.codes

data.sample(3)
Out[4]:
Unnamed: 0 VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID ... improvement_surcharge total_amount duration_secs duration_mins week day month tpep_pickup_time month_num day_num
15046 55154879 1 2017-06-30 23:48:43 2017-06-30 23:58:49 2 1.80 1 N 164 230 ... 0.3 12.35 606.0 10.100000 26 Fri Jun 23:48:43 5 4
5433 85913300 2 2017-10-12 16:00:27 2017-10-12 16:17:57 1 1.94 1 N 186 125 ... 0.3 14.30 1050.0 17.500000 41 Thu Oct 16:00:27 9 3
8726 86820791 2 2017-10-15 09:37:14 2017-10-15 09:38:42 5 0.57 2 N 164 230 ... 0.3 58.56 88.0 1.466667 41 Sun Oct 09:37:14 9 6

3 rows × 26 columns

Divide 24 hour day into categories

In [5]:
data["hour"] = data["tpep_pickup_datetime"].dt.strftime("%H").astype(int)

conditions = [(data['hour'] >= 1) & (data['hour'] < 5),
              (data['hour'] >= 5) & (data['hour'] < 9),  (data['hour'] >= 9) & (data['hour'] < 13),
              (data['hour'] >= 13) & (data['hour'] < 17),(data['hour'] >= 17) & (data['hour'] < 21),
              (data['hour'] >= 21), (data['hour'] < 1)
    ]

# create a list of the values we want to assign for each condition
values = ['Night owls', 'Morning people', 'My people', "Afternoon rush", "night travellers", 'Late night', "Late night"]
values_ = ["01:00 - 05:00", "05:00 - 09:00", "09:00 - 13:00", "13:00 - 17:00", "17:00 - 21:00", "21:00 - 01:00", "21:00 - 01:00"]

# create a new column and use np.select to assign values to it using our lists as arguments
data['period_of_day'] = np.select(conditions, values)
data['time_of_day'] = np.select(conditions, values_)

#del values, values_
#data.drop("hour", axis = 1, inplace = True)

#data.sample(5)
data.time_of_day.value_counts()
Out[5]:
17:00 - 21:00    5414
13:00 - 17:00    4650
09:00 - 13:00    4354
21:00 - 01:00    4237
05:00 - 09:00    2611
01:00 - 05:00    1433
Name: time_of_day, dtype: int64
In [6]:
conditions = [(data['hour'] >= 6) & (data['hour'] < 10),
              (data['hour'] >= 10) & (data['hour'] < 16),
              (data['hour'] >= 16) & (data['hour'] < 20),
              (data['hour'] >= 20), (data['hour'] < 6) ]

values = ['Morning Rush', 'Day-lighers', 'Evening Rush', "Night owls","Night owls"]
data['period_of_day2'] = np.select(conditions, values)

data['period_of_day2'].value_counts()
Out[6]:
Night owls      7200
Day-lighers     6760
Evening Rush    5251
Morning Rush    3488
Name: period_of_day2, dtype: int64

Payment types

In [7]:
data['payment_cats'] = data.payment_type.replace({1:'Credit card',2:'Cash', 3: 'No charge', 4:'Dispute', 5:'Unknown', 6:'Voided trip'})
data['payment_cats'].value_counts()
Out[7]:
Credit card    15265
Cash            7267
No charge        121
Dispute           46
Name: payment_cats, dtype: int64

More time variables

In [8]:
data['minutes'] = data["tpep_pickup_datetime"].dt.strftime("%M").astype(int)
In [9]:
data["time"] = data.hour + data.minutes/60
In [10]:
data[["tpep_pickup_datetime", "hour", "minutes", "time"]].sample(5)
Out[10]:
tpep_pickup_datetime hour minutes time
16948 2017-05-29 11:28:46 11 28 11.466667
5688 2017-05-15 11:44:53 11 44 11.733333
13631 2017-01-02 19:16:30 19 16 19.266667
20014 2017-12-23 12:24:59 12 24 12.400000
1824 2017-03-22 10:13:02 10 13 10.216667

Sidetracked¶

Time-based data visualizations

In [10]:
temp = data.drop(["Unnamed: 0", "VendorID", "RatecodeID", "PULocationID", "DOLocationID", "payment_type", "week"], 
                 axis = 1).groupby(pd.Grouper(key = "tpep_pickup_datetime", freq="4H")).sum(numeric_only = True
               ).sort_values(by = "duration_secs", ascending = False).reset_index().copy(deep = True)

temp.tpep_pickup_datetime = temp.tpep_pickup_datetime.astype('string').str.slice(start = 11)

temp
Out[10]:
tpep_pickup_datetime passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount duration_secs duration_mins month_num day_num hour minutes time
0 20:00:00 24 38.09 175.0 8.0 8.0 25.36 0.00 4.8 221.16 183111.0 3051.850000 80 64 338 575 347.583333
1 08:00:00 18 41.33 174.0 0.0 6.0 24.29 5.76 3.6 213.65 178483.0 2974.716667 132 24 113 346 118.766667
2 16:00:00 26 90.00 348.5 0.0 7.5 50.74 5.76 4.5 417.00 109384.0 1823.066667 45 90 261 539 269.983333
3 20:00:00 38 94.81 343.0 8.5 9.0 68.45 13.68 5.4 448.03 109372.0 1822.866667 54 54 389 493 397.216667
4 16:00:00 44 41.93 243.0 18.5 9.5 21.19 5.76 5.7 303.65 106113.0 1768.550000 76 38 335 662 346.033333
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2185 04:00:00 0 0.00 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 0.000000 0 0 0 0 0.000000
2186 04:00:00 0 0.00 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 0.000000 0 0 0 0 0.000000
2187 00:00:00 0 0.00 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 0.000000 0 0 0 0 0.000000
2188 00:00:00 0 0.00 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 0.000000 0 0 0 0 0.000000
2189 04:00:00 0 0.00 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 0.000000 0 0 0 0 0.000000

2190 rows × 17 columns

In [40]:
# Taxi rides collectively take long around 16h00, evening traffic?

sb.barplot(temp.groupby("tpep_pickup_datetime").sum(numeric_only = True).reset_index(), y = "duration_mins", x = "tpep_pickup_datetime")
Out[40]:
<AxesSubplot: xlabel='tpep_pickup_datetime', ylabel='duration_mins'>
In [43]:
# Best time to make money is around that 16h00 period, earning over $185, 
# compared to around $70 around 04hoo or $170 around noon

sb.barplot(temp.groupby("tpep_pickup_datetime").mean(numeric_only = True).reset_index(), y = "fare_amount", x = "tpep_pickup_datetime")
Out[43]:
<AxesSubplot: xlabel='tpep_pickup_datetime', ylabel='fare_amount'>

Sidetrack 2: Tabular visuals¶

Average nums for each day of the month

In [47]:
data.drop(["Unnamed: 0", "VendorID", "RatecodeID", "PULocationID", "DOLocationID", "payment_type", "week"], axis = 1).groupby(["month", "day"]).mean(numeric_only = True).reset_index().iloc[0:10,].style.background_gradient()
Out[47]:
  month day passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount duration_secs duration_mins month_num day_num
0 Jan Mon 1.511864 2.915085 12.216949 0.276271 0.498305 1.737322 0.424475 0.300000 15.453322 750.881356 12.514689 0.000000 0.000000
1 Jan Tue 1.702703 2.849610 12.965465 0.354354 0.495495 1.838949 0.409429 0.298198 16.373604 849.798799 14.163313 0.000000 1.000000
2 Jan Wed 1.696000 2.492000 11.782000 0.398000 0.500000 1.697320 0.187440 0.300000 14.872560 810.228000 13.503800 0.000000 2.000000
3 Jan Thu 1.663082 2.870466 12.978495 0.394265 0.496416 2.029534 0.327742 0.300000 16.544194 858.329749 14.305496 0.000000 3.000000
4 Jan Fri 1.676923 2.785577 12.648077 0.409615 0.498077 1.760577 0.274769 0.300000 15.891115 853.519231 14.225321 0.000000 4.000000
5 Jan Sat 1.793358 2.629889 11.934317 0.190037 0.498155 1.718450 0.224871 0.300000 14.865830 775.937269 12.932288 0.000000 5.000000
6 Jan Sun 1.647249 3.192621 13.839806 0.199029 0.498382 1.734466 0.340647 0.300000 16.924951 875.297735 14.588296 0.000000 6.000000
7 Feb Mon 1.682540 3.197989 17.998360 0.335979 0.494709 2.806508 0.234497 0.300000 22.196243 762.687831 12.711464 1.000000 0.000000
8 Feb Tue 1.591440 2.326965 11.313230 0.451362 0.500000 1.615759 0.160778 0.300000 14.348716 773.027237 12.883787 1.000000 1.000000
9 Feb Wed 1.553030 3.285227 13.717803 0.437500 0.500000 2.124659 0.461667 0.300000 17.549015 894.037879 14.900631 1.000000 2.000000

Total nums for each day of the month

In [48]:
data.drop(["Unnamed: 0", "VendorID", "RatecodeID", "PULocationID", "DOLocationID", "payment_type", "week"], axis = 1).groupby(["month", "day"]).sum(numeric_only = True).reset_index().iloc[0:10,].style.background_gradient()
Out[48]:
  month day passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount duration_secs duration_mins month_num day_num
0 Jan Mon 446 859.950000 3604.000000 81.500000 147.000000 512.510000 125.220000 88.500000 4558.730000 221510.000000 3691.833333 0 0
1 Jan Tue 567 948.920000 4317.500000 118.000000 165.000000 612.370000 136.340000 99.300000 5452.410000 282983.000000 4716.383333 0 333
2 Jan Wed 424 623.000000 2945.500000 99.500000 125.000000 424.330000 46.860000 75.000000 3718.140000 202557.000000 3375.950000 0 500
3 Jan Thu 464 800.860000 3621.000000 110.000000 138.500000 566.240000 91.440000 83.700000 4615.830000 239474.000000 3991.233333 0 837
4 Jan Fri 436 724.250000 3288.500000 106.500000 129.500000 457.750000 71.440000 78.000000 4131.690000 221915.000000 3698.583333 0 1040
5 Jan Sat 486 712.700000 3234.200000 51.500000 135.000000 465.700000 60.940000 81.300000 4028.640000 210279.000000 3504.650000 0 1355
6 Jan Sun 509 986.520000 4276.500000 61.500000 154.000000 535.950000 105.260000 92.700000 5229.810000 270467.000000 4507.783333 0 1854
7 Feb Mon 318 604.420000 3401.690000 63.500000 93.500000 530.430000 44.320000 56.700000 4195.090000 144148.000000 2402.466667 189 0
8 Feb Tue 409 598.030000 2907.500000 116.000000 128.500000 415.250000 41.320000 77.100000 3687.620000 198668.000000 3311.133333 257 257
9 Feb Wed 410 867.300000 3621.500000 115.500000 132.000000 560.910000 121.880000 79.200000 4632.940000 236026.000000 3933.766667 264 528

Average nums for each day of the week

  • On average, Taxi trips take longer on thursdays, and the least on Tuesdays.
  • Mondays are the best days to make money, on average drivers make the least amount of money on saturdays
  • Sundays, followed by mondays have the longest trips on average
  • Weekdays are best days to get tips
In [50]:
data.drop(["Unnamed: 0", "VendorID", "RatecodeID", "PULocationID", "DOLocationID", "payment_type", "week"], axis = 1).groupby(["day"]).mean(numeric_only = True).reset_index().iloc[0:10,].style.background_gradient()
Out[50]:
  day passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount duration_secs duration_mins month_num day_num
0 Mon 1.620607 3.064715 13.432852 0.375299 0.497271 1.935827 0.370051 0.299488 16.913808 925.504606 15.425077 5.486182 0.000000
1 Tue 1.619137 2.805744 13.052908 0.390557 0.497811 1.865009 0.315710 0.299625 16.424997 916.036898 15.267282 5.367730 1.000000
2 Wed 1.611209 2.827032 12.886655 0.397050 0.498378 1.916667 0.310519 0.299558 16.315773 1090.023599 18.167060 5.502950 2.000000
3 Thu 1.608172 2.922404 13.373560 0.390212 0.496620 1.900791 0.338586 0.299559 16.808322 1113.103762 18.551729 5.529101 3.000000
4 Fri 1.632288 2.817017 12.980021 0.393788 0.497656 1.842406 0.335775 0.299648 16.354744 1072.891591 17.881527 5.374158 4.000000
5 Sat 1.712504 2.812043 12.350760 0.194684 0.497178 1.638966 0.222257 0.299465 15.205049 975.732997 16.262217 5.334125 5.000000
6 Sun 1.694797 3.190644 13.178165 0.181121 0.497165 1.755060 0.300617 0.299500 16.218833 1034.213476 17.236891 5.360574 6.000000

Average nums for each month of the year

In [49]:
data.drop(["Unnamed: 0", "VendorID", "RatecodeID", "PULocationID", "DOLocationID", "payment_type", "week"], axis = 1).groupby(["month"]).mean(numeric_only = True).reset_index().iloc[0:10,].style.background_gradient()
Out[49]:
  month passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount duration_secs duration_mins month_num day_num
0 Jan 1.668503 2.832349 12.662594 0.314722 0.497747 1.790110 0.319229 0.299700 15.891462 825.831247 13.763854 0.000000 2.963946
1 Feb 1.645562 2.831933 13.005195 0.338609 0.497739 1.934585 0.276439 0.299661 16.358332 904.775014 15.079584 1.000000 3.140192
2 Mar 1.618350 2.883324 12.903631 0.345778 0.496828 1.810137 0.288180 0.299414 16.147335 983.993655 16.399894 2.000000 3.160078
3 Apr 1.601288 2.934656 12.647945 0.328133 0.497524 1.780248 0.298524 0.299406 15.855641 1128.545319 18.809089 3.000000 3.212977
4 May 1.643318 3.013130 13.397923 0.346001 0.498510 1.943835 0.308803 0.299702 16.805057 1037.839046 17.297317 4.000000 2.861401
5 Jun 1.664969 2.977001 13.468814 0.347505 0.495927 1.839027 0.300621 0.299542 16.761976 1157.679735 19.294662 5.000000 2.996945
6 Jul 1.706541 2.820029 12.588863 0.343253 0.496759 1.656730 0.293235 0.299293 15.685115 1081.131408 18.018857 6.000000 3.020035
7 Aug 1.680394 2.985226 12.925180 0.342807 0.498260 1.714292 0.316334 0.299826 16.101833 897.383991 14.956400 7.000000 2.803364
8 Sep 1.625144 2.941488 13.011534 0.325548 0.497982 1.806943 0.322745 0.299654 16.266655 1013.831027 16.897184 8.000000 3.085352
9 Oct 1.581154 2.890656 12.967859 0.337444 0.497287 1.848757 0.360829 0.299556 16.312694 922.344351 15.372406 9.000000 2.923532

Total duration for each hour for every day of the week

  • 8AMs on wednesday are tense!
  • You can see that people are still club-hopping between 12AM and 2AM during weekends
In [87]:
data[['month', "day", "hour",'duration_secs']].groupby(["day", "hour"]).sum(numeric_only = True).reset_index().pivot(columns = "day", index = "hour", values = "duration_secs").style.background_gradient(cmap="Reds")
C:\Users\Jason\AppData\Local\Temp\ipykernel_14164\2620107525.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  data[['month', "day", "hour",'duration_secs']].groupby(["day", "hour"]).sum().reset_index().pivot(columns = "day", index = "hour", values = "duration_secs").style.background_gradient(cmap="Reds")
Out[87]:
day Mon Tue Wed Thu Fri Sat Sun
hour              
0 38862.000000 40722.000000 52228.000000 62367.000000 94209.000000 204469.000000 147788.000000
1 22787.000000 24134.000000 22432.000000 29200.000000 46065.000000 191805.000000 184829.000000
2 19623.000000 18327.000000 15190.000000 25536.000000 37076.000000 164229.000000 150645.000000
3 19302.000000 19034.000000 8919.000000 14315.000000 12479.000000 53133.000000 82270.000000
4 6443.000000 15439.000000 9220.000000 21398.000000 20011.000000 41307.000000 51770.000000
5 29766.000000 11309.000000 23688.000000 29275.000000 27235.000000 21588.000000 21260.000000
6 62143.000000 43773.000000 58133.000000 68603.000000 62394.000000 25789.000000 16026.000000
7 98225.000000 99796.000000 130642.000000 115997.000000 186094.000000 41691.000000 109834.000000
8 134528.000000 161328.000000 314016.000000 164927.000000 153944.000000 42419.000000 44621.000000
9 144429.000000 150299.000000 191268.000000 157763.000000 249017.000000 93568.000000 144867.000000
10 134091.000000 139957.000000 155439.000000 207698.000000 241140.000000 84865.000000 93034.000000
11 147551.000000 162668.000000 240542.000000 284291.000000 165428.000000 194987.000000 104397.000000
12 119534.000000 127735.000000 152722.000000 181283.000000 170374.000000 134771.000000 153530.000000
13 126219.000000 214413.000000 173703.000000 159368.000000 128368.000000 141820.000000 245732.000000
14 246133.000000 208297.000000 138133.000000 199687.000000 163898.000000 253579.000000 167933.000000
15 149031.000000 181703.000000 201264.000000 178007.000000 194256.000000 146213.000000 231953.000000
16 146979.000000 236633.000000 156146.000000 195261.000000 170941.000000 146870.000000 164651.000000
17 118572.000000 168109.000000 194816.000000 274634.000000 204962.000000 163509.000000 260220.000000
18 194150.000000 239935.000000 296015.000000 197051.000000 188136.000000 192176.000000 146597.000000
19 158315.000000 156002.000000 260556.000000 198013.000000 269069.000000 177685.000000 220487.000000
20 129501.000000 148751.000000 250293.000000 256430.000000 309608.000000 232319.000000 111856.000000
21 295615.000000 138778.000000 281148.000000 189172.000000 172752.000000 220166.000000 85711.000000
22 99699.000000 129134.000000 257161.000000 162954.000000 246186.000000 159164.000000 94739.000000
23 71156.000000 93210.000000 111506.000000 413549.000000 148137.000000 157171.000000 65822.000000

Bunch of heatmaps

In [91]:
plt.figure(figsize=(11, 11))
sb.heatmap(data[['month', "day", "hour",'duration_mins']].groupby(["day", "hour"]).sum(numeric_only = True).reset_index().pivot(columns = "day", index = "hour", values = "duration_mins"))
plt.title("Total taxi trip duration (in minutes) for each 4-hour interval everyday")
C:\Users\Jason\AppData\Local\Temp\ipykernel_14164\777194791.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  sb.heatmap(data[['month', "day", "hour",'duration_mins']].groupby(["day", "hour"]).sum().reset_index().pivot(columns = "day", index = "hour", values = "duration_mins"))
Out[91]:
Text(0.5, 1.0, 'Total taxi trip duration (in minutes) for each 4-hour interval everyday')
In [89]:
from matplotlib import pyplot as plt

plt.figure(figsize=(11, 7))
sb.heatmap(data[['month', "day", "time_of_day",'duration_mins']].groupby(["day", "time_of_day"]).sum().reset_index().pivot(columns = "day", index = "time_of_day", values = "duration_mins"))
plt.title("Total taxi trip duration (in minutes) for each 4-hour interval everyday")
C:\Users\Jason\AppData\Local\Temp\ipykernel_14164\3928804222.py:4: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  sb.heatmap(data[['month', "day", "time_of_day",'duration_mins']].groupby(["day", "time_of_day"]).sum().reset_index().pivot(columns = "day", index = "time_of_day", values = "duration_mins"))
Out[89]:
Text(0.5, 1.0, 'Total taxi trip duration (in seconds) for each 4-hour interval everyday')
In [93]:
data.columns
Out[93]:
Index(['Unnamed: 0', 'VendorID', 'tpep_pickup_datetime',
       'tpep_dropoff_datetime', 'passenger_count', 'trip_distance',
       'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID',
       'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
       'tolls_amount', 'improvement_surcharge', 'total_amount',
       'duration_secs', 'duration_mins', 'week', 'day', 'month', 'month_num',
       'day_num', 'tpep_pickup_time', 'time_of_day', 'period_of_day', 'hour'],
      dtype='object')
In [98]:
# plot line graph on axis #1
ax1 = sb.lineplot(
    x='tpep_dropoff_datetime', 
    y='duration_mins', 
    data=data.groupby("tpep_dropoff_datetime").sum(numeric_only = True).reset_index(), 
    sort=False, 
    color='blue'
)
ax1.set_ylabel('duration')
#ax1.set_ylim(0, 25)
ax1.legend(['duration'], loc="upper left")
# set up the 2nd axis
ax2 = ax1.twinx()
# plot bar graph on axis #2
sb.lineplot(
    x = 'tpep_dropoff_datetime', 
    y = 'fare_amount', 
    data = data.groupby("tpep_dropoff_datetime").sum(numeric_only = True).reset_index(), 
    color ='orange', 
    alpha = 0.5, 
    ax = ax2       # Pre-existing axes for the plot
)
ax2.grid(b = False) # turn off grid #2
ax2.set_ylabel('Fare Amount')
#ax2.set_ylim(0, 90)
ax2.legend(['Fare Amount'], loc = "upper right")
plt.show()
C:\Users\Jason\AppData\Local\Temp\ipykernel_14164\1151285376.py:23: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax2.grid(b = False) # turn off grid #2

Distribution of trip durations (in seconds) per month

Excludes outliers

In [51]:
plt.figure(figsize = (9,6))
sb.boxplot(data = data[data.duration_secs >= 0], y = "duration_secs", x = "month", showfliers = False)
Out[51]:
<AxesSubplot: xlabel='month', ylabel='duration_secs'>
In [53]:
plt.figure(figsize = (9,6))
sb.lineplot(data = data[data.duration_secs >= 0], y = "duration_secs", x = "month")
Out[53]:
<AxesSubplot: xlabel='month', ylabel='duration_secs'>

Encode Variables¶

I made these functions to use to label encode and one hot encode easily, ended up not encoding anything so I ended up not using them. I'll just keep them for future =D

In [70]:
def one_hot_enc(dataset: pd.DataFrame, variables:list or str or tuple):
    '''
    def one_hot_enc(df, labels)

    Perform one-hot encoding on the provided dataset. Avoid having to manually one-hot encode each variable.

    # Parameter:
    dataset: Pandas DataFrame
        The full dataset to be updated with encoded variables
    variables: single label or list-like.
        The variables to be replaced with one-hot encoded versions

    #Returns:
    Dataset with the one-hot encoded variables

    -> Dataframe'''

    output = pd.DataFrame()

    if (type(data) == pd.core.frame.DataFrame) and (type(variables) == str):
        return pd.get_dummies(data[variables])

    elif (type(data) == pd.core.frame.DataFrame) and (type(variables) in [list, tuple]):
        for variable in variables:
            if output.empty:
                output = pd.get_dummies(data[variable])
            else:
                output = pd.concat([output, pd.get_dummies(dataset[variable])], axis = 1)
    else:
        raise TypeError("You should use pandas dataframes with string or list-like object containing column names")

    return output


def label_enc(dataset: pd.DataFrame, variables: list or tuple or str):
    '''
    def label_enc (df, labels)

    Perform label encoding on the provided dataset

    # Parameter:
    ------------
    dataset: pandas dataframe
        The full dataset to be updated with encoded variables
    variables - single label or list-like.
        The variables to be replaced with label encoded versions

    # Returns:
    -----------
    Dataset: pandas DataFrame
        Dataset with the label encoded variables
    '''
    output = pd.DataFrame()

    if (type(data) == pd.core.frame.DataFrame) and (type(variables) == str):
        print(1)
        return data[variables].cat.codes

    elif (type(data) == pd.core.frame.DataFrame) and (type(variables) in [list, tuple]):

        for variable in variables:
            if output.empty:
                output = data[variable].cat.codes
                print(2)
            else:
                output = pd.concat([output, data[variable].cat.codes], axis = 1)
                print(3)

    else:
        raise TypeError("You should use pandas dataframes with string or list-like object containing column names")

    return output

Actual EDA¶

Client request: Provide a summary of the column Dtypes, data value nonnull counts, relevant and irrelevant columns, along with anything else code related you think is worth showing in the notebook?

In [13]:
data.describe()
Out[13]:
VendorID passenger_count trip_distance RatecodeID PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
count 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000 22699.000000
mean 1.556236 1.642319 2.913313 1.043394 162.412353 161.527997 1.336887 13.026629 0.333275 0.497445 1.835781 0.312542 0.299551 16.310502
std 0.496838 1.285231 3.653171 0.708391 66.633373 70.139691 0.496211 13.243791 0.463097 0.039465 2.800626 1.399212 0.015673 16.097295
min 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 -120.000000 -1.000000 -0.500000 0.000000 0.000000 -0.300000 -120.300000
25% 1.000000 1.000000 0.990000 1.000000 114.000000 112.000000 1.000000 6.500000 0.000000 0.500000 0.000000 0.000000 0.300000 8.750000
50% 2.000000 1.000000 1.610000 1.000000 162.000000 162.000000 1.000000 9.500000 0.000000 0.500000 1.350000 0.000000 0.300000 11.800000
75% 2.000000 2.000000 3.060000 1.000000 233.000000 233.000000 2.000000 14.500000 0.500000 0.500000 2.450000 0.000000 0.300000 17.800000
max 2.000000 6.000000 33.960000 99.000000 265.000000 265.000000 4.000000 999.990000 4.500000 0.500000 200.000000 19.100000 0.300000 1200.290000
In [15]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 22699 entries, 24870114 to 17208911
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               22699 non-null  int64  
 1   tpep_pickup_datetime   22699 non-null  object 
 2   tpep_dropoff_datetime  22699 non-null  object 
 3   passenger_count        22699 non-null  int64  
 4   trip_distance          22699 non-null  float64
 5   RatecodeID             22699 non-null  int64  
 6   store_and_fwd_flag     22699 non-null  object 
 7   PULocationID           22699 non-null  int64  
 8   DOLocationID           22699 non-null  int64  
 9   payment_type           22699 non-null  int64  
 10  fare_amount            22699 non-null  float64
 11  extra                  22699 non-null  float64
 12  mta_tax                22699 non-null  float64
 13  tip_amount             22699 non-null  float64
 14  tolls_amount           22699 non-null  float64
 15  improvement_surcharge  22699 non-null  float64
 16  total_amount           22699 non-null  float64
dtypes: float64(8), int64(6), object(3)
memory usage: 3.1+ MB
In [17]:
data.isnull().sum()
Out[17]:
VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
dtype: int64

Money variables - cost of taxi rides

In [24]:
data.groupby(["payment_type"]).agg(["mean", "median", "min", "max"])[["tip_amount", "tolls_amount", "total_amount",]]
C:\Users\Jason\AppData\Local\Temp\ipykernel_16012\2768404590.py:1: FutureWarning: ['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'store_and_fwd_flag'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
  data.groupby(["payment_type"]).agg(["mean", "median", "min", "max"])[["tip_amount", "tolls_amount", "total_amount",]]
Out[24]:
tip_amount tolls_amount total_amount
mean median min max mean median min max mean median min max
payment_type
1 2.7298 2.0 0.0 200.0 0.357659 0.0 0.0 19.10 17.663577 12.95 0.0 1200.29
2 0.0000 0.0 0.0 0.0 0.214441 0.0 0.0 18.28 13.545821 9.80 0.0 450.30
3 0.0000 0.0 0.0 0.0 0.388595 0.0 0.0 12.50 13.579669 8.30 -5.3 78.30
4 0.0000 0.0 0.0 0.0 0.638261 0.0 0.0 11.52 11.238261 9.30 -120.3 64.32

Total amount

Money should not be negative, but for some reason the total amount above has negative values. Let's investigate

In [140]:
f"There are {data[data.total_amount < 0].shape[0]} negative values for the total amount paid"
Out[140]:
'There are 14 negative values for the total amount paid'
In [146]:
data[data.total_amount < 0].groupby("payment_cats").total_amount.count()

# all the negative amounts come from disputed or uncharged transactions, wonder what happened there
Out[146]:
payment_cats
Dispute      7
No charge    7
Name: total_amount, dtype: int64
In [28]:
# Lets see these rides!

data[data.total_amount < 0]
Out[28]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
105454287 2 12/13/2017 2:02:39 AM 12/13/2017 2:03:08 AM 6 0.12 1 N 161 161 3 -2.5 -0.5 -0.5 0.0 0.0 -0.3 -3.8
57337183 2 07/05/2017 11:02:23 AM 07/05/2017 11:03:00 AM 1 0.04 1 N 79 79 3 -2.5 0.0 -0.5 0.0 0.0 -0.3 -3.3
97329905 2 11/16/2017 8:13:30 PM 11/16/2017 8:14:50 PM 2 0.06 1 N 237 237 4 -3.0 -0.5 -0.5 0.0 0.0 -0.3 -4.3
28459983 2 04/06/2017 12:50:26 PM 04/06/2017 12:52:39 PM 1 0.25 1 N 90 68 3 -3.5 0.0 -0.5 0.0 0.0 -0.3 -4.3
833948 2 01/03/2017 8:15:23 PM 01/03/2017 8:15:39 PM 1 0.02 1 N 170 170 3 -2.5 -0.5 -0.5 0.0 0.0 -0.3 -3.8
91187947 2 10/28/2017 8:39:36 PM 10/28/2017 8:41:59 PM 1 0.41 1 N 236 237 3 -3.5 -0.5 -0.5 0.0 0.0 -0.3 -4.8
55302347 2 06/05/2017 5:34:25 PM 06/05/2017 5:36:29 PM 2 0.00 1 N 238 238 4 -2.5 -1.0 -0.5 0.0 0.0 -0.3 -4.3
58395501 2 07/09/2017 7:20:59 AM 07/09/2017 7:23:50 AM 1 0.64 1 N 50 48 3 -4.5 0.0 -0.5 0.0 0.0 -0.3 -5.3
29059760 2 04/08/2017 12:00:16 AM 04/08/2017 11:15:57 PM 1 0.17 5 N 138 138 4 -120.0 0.0 0.0 0.0 0.0 -0.3 -120.3
109276092 2 12/24/2017 10:37:58 PM 12/24/2017 10:41:08 PM 5 0.40 1 N 164 161 4 -4.0 -0.5 -0.5 0.0 0.0 -0.3 -5.3
24690146 2 03/24/2017 7:31:13 PM 03/24/2017 7:34:49 PM 1 0.46 1 N 87 45 4 -4.0 -1.0 -0.5 0.0 0.0 -0.3 -5.8
43859760 2 05/22/2017 3:51:20 PM 05/22/2017 3:52:22 PM 1 0.10 1 N 230 163 3 -3.0 0.0 -0.5 0.0 0.0 -0.3 -3.8
75926915 2 09/09/2017 10:59:51 PM 09/09/2017 11:02:06 PM 1 0.24 1 N 116 116 4 -3.5 -0.5 -0.5 0.0 0.0 -0.3 -4.8
14668209 2 02/24/2017 12:38:17 AM 02/24/2017 12:42:05 AM 1 0.70 1 N 65 25 4 -4.5 -0.5 -0.5 0.0 0.0 -0.3 -5.8
In [151]:
data[(data.total_amount < 0) & ((data.PULocationID == data.DOLocationID) | (data.trip_distance == 0))].groupby("payment_cats").total_amount.count()

# seven of these uncharged or disputed transactions ended in the same area they started
# I'll sort these out when tackling outliers later
Out[151]:
payment_cats
Dispute      4
No charge    3
Name: total_amount, dtype: int64

More visuals¶

At this point I'm just tryna figure out if there are any patterns we can identify visually. Enjoy the heatmaps! :)

More heatmaps¶

In [69]:
data[['month', "day", 'duration_secs']].groupby(['month', "day"]).sum().reset_index()
Out[69]:
month day duration_secs
0 Jan Mon 221510.0
1 Jan Tue 282983.0
2 Jan Wed 202557.0
3 Jan Thu 239474.0
4 Jan Fri 221915.0
... ... ... ...
79 Dec Wed 596897.0
80 Dec Thu 533153.0
81 Dec Fri 344507.0
82 Dec Sat 236396.0
83 Dec Sun 293622.0

84 rows × 3 columns

In [81]:
plt.figure(figsize=(11, 7))
plt.title("Total durations of taxi trips (in seconds) for everyday of each month")
sb.heatmap(data[['month', "day", 'duration_secs']].groupby(['month', "day"]).sum().reset_index().pivot(columns = "month", index = "day", values = "duration_secs"))
Out[81]:
<AxesSubplot: title={'center': 'Total durations of taxi trips (in seconds) for everyday of each month'}, xlabel='month', ylabel='day'>
In [80]:
plt.figure(figsize=(11, 7))
sb.heatmap(data[['month', "day", 'duration_secs']].groupby(['month', "day"]).mean().reset_index().pivot(columns = "month", index = "day", values = "duration_secs"))
plt.title("Average taxi trip duration (in seconds) for everyday of each month")
Out[80]:
Text(0.5, 1.0, 'Average taxi trip duration (in seconds) for everyday of each month')

Barplots¶

In [87]:
plt.figure(figsize=(12, 7))
plt.title("Average aggregated duration (in seconds) of taxi rides for each day of the month")
sb.barplot(data = data[['month', "day", 'duration_secs']].groupby(['month', "day"]).mean().reset_index(), x = "month", y = "duration_secs", hue = "day")
Out[87]:
<AxesSubplot: title={'center': 'Average aggregated duration (in seconds) of taxi rides for each day of the month'}, xlabel='month', ylabel='duration_secs'>
In [88]:
plt.figure(figsize=(12, 7))
plt.title("Total aggregated duration (in seconds) of taxi rides for each day of the month")
sb.barplot(data = data[['month', "day", 'duration_secs']].groupby(['month', "day"]).sum().reset_index(), x = "month", y = "duration_secs", hue = "day")
Out[88]:
<AxesSubplot: title={'center': 'Total aggregated duration (in seconds) of taxi rides for each day of the month'}, xlabel='month', ylabel='duration_secs'>

Boxplots¶

In [94]:
plt.figure(figsize=(12, 7))
plt.title("Distribution of trip duration (in seconds) of taxi rides for each month")
sb.boxplot(data = data[['month', "day", 'duration_secs']], x = "month", y = "duration_secs", showfliers = False)
Out[94]:
<AxesSubplot: title={'center': 'Distribution of trip duration (in seconds) of taxi rides for each day of the month'}, xlabel='month', ylabel='duration_secs'>
In [97]:
plt.figure(figsize=(12, 7))
plt.title("Distribution of trip duration (in seconds) of taxi rides for each day of the week")
sb.boxplot(data = data[['month', "day", 'duration_secs']], x = "day", y = "duration_secs", showfliers = False)
Out[97]:
<AxesSubplot: title={'center': 'Distribution of trip duration (in seconds) of taxi rides for each day of the week'}, xlabel='day', ylabel='duration_secs'>

Even more heatmaps¶

In [108]:
plt.figure(figsize=(11, 7))
plt.title("Total durations of taxi passengers for everyday of each month")
sb.heatmap(data[['month', "day", 'passenger_count']].groupby(['month', "day"]).sum().reset_index().pivot(columns = "month", index = "day", values = "passenger_count"))
Out[108]:
<AxesSubplot: title={'center': 'Total durations of taxi passengers for everyday of each month'}, xlabel='month', ylabel='day'>
In [113]:
plt.figure(figsize=(11, 7))
plt.title("Average number taxi passengers for everyday of each month")
sb.heatmap(data[['month', "day", 'passenger_count']].groupby(['month', "day"]).mean().reset_index().pivot(columns = "month", index = "day", values = "passenger_count"))
Out[113]:
<AxesSubplot: title={'center': 'Average number taxi passengers for everyday of each month'}, xlabel='month', ylabel='day'>

More barplots¶

In [114]:
plt.figure(figsize=(11, 7))
plt.title("Number of taxi passengers for everyday of each month")
plt.ylim(1.5, 1.75)
sb.barplot(data[['month', "day", 'passenger_count']].groupby(['month', "day"]).mean().reset_index().pivot(columns = "month", index = "day", values = "passenger_count"))
Out[114]:
<AxesSubplot: title={'center': 'Number of taxi passengers for everyday of each month'}, xlabel='month'>
In [117]:
plt.figure(figsize=(11, 7))
plt.title("Number of taxi passengers for everyday of each month")
plt.ylim(1.5, 1.75)
sb.barplot(data[['month', "day", 'passenger_count']].groupby(['month', "day"]).mean().reset_index(), x = "day", y = "passenger_count")
Out[117]:
<AxesSubplot: title={'center': 'Number of taxi passengers for everyday of each month'}, xlabel='day', ylabel='passenger_count'>
In [ ]:
plt.figure(figsize=(11, 7))
sb.heatmap(data[['month', "day", '']].groupby(['month', "day"]).mean().reset_index().pivot(columns = "month", index = "day", values = "duration_secs"))
plt.title("Average taxi trip duration (in seconds) for everyday of each month")

RatecodeID¶

  • presumably RatecodeID 1 is the cheap one since everyone is using it
  • Ratecode 99 might be input error
  • Ratecode 5 charges nearly 7 times as much as code 1
In [156]:
data[['RatecodeID', "fare_amount"]].groupby('RatecodeID').agg(["count", 'mean'])["fare_amount"].reset_index().style.background_gradient()
Out[156]:
  RatecodeID count mean
0 1 22070 11.807340
1 2 513 52.000000
2 3 39 61.961538
3 4 8 73.875000
4 5 68 78.570000
5 99 1 77.200000
In [157]:
plt.title("Number of taxi passengers charged each RatecodeID")
sb.barplot(data[['RatecodeID', "fare_amount"]].groupby('RatecodeID').agg(["count", 'mean'])["fare_amount"].reset_index(), x = "RatecodeID", y = "count")
Out[157]:
<AxesSubplot: title={'center': 'Number of taxi passengers charged each RatecodeID'}, xlabel='RatecodeID', ylabel='count'>
In [158]:
plt.title("Number of taxi passengers charged each RatecodeID")
sb.barplot(data[['RatecodeID', "fare_amount"]].groupby('RatecodeID').agg(["count", 'mean'])["fare_amount"].reset_index(), x = "RatecodeID", y = "mean")
Out[158]:
<AxesSubplot: title={'center': 'Number of taxi passengers charged each RatecodeID'}, xlabel='RatecodeID', ylabel='mean'>

Visualize Statistical Relationships¶

Correlation Matrix¶

The functions below attempt to replicate the correlation_matrix function from R

In [160]:
cols = ['trip_distance', 'fare_amount', 'tip_amount','tolls_amount', 'total_amount']
In [36]:
def corrdot(*args, **kwargs):
    corr_r = args[0].corr(args[1], 'pearson')
    corr_text = round(corr_r, 2)
    ax = plt.gca()
    font_size = abs(corr_r) * 80 + 5
    ax.annotate(corr_text, [.5, .5,],  xycoords = "axes fraction",
                ha ='center', va ='center', fontsize = font_size)

def corrfunc(x, y, **kws):
    r, p = stats.pearsonr(x, y)
    p_stars = ''
    if p <= 0.05:
        p_stars = '*'
    if p <= 0.01:
        p_stars = '**'
    if p <= 0.001:
        p_stars = '***'
    ax = plt.gca()
    ax.annotate(p_stars, xy = (0.65, 0.6), xycoords = ax.transAxes,
                color = 'red', fontsize = 70)
In [162]:
sb.set(style='white', font_scale=1.6)
np.seterr(invalid = 'ignore')
g = sb.PairGrid(data[cols], aspect=1.5, diag_sharey=False, despine=False)
g.map_lower(sb.regplot, lowess = True, ci=False,
            line_kws={'color': 'red', 'lw': 1},
            scatter_kws={'color': 'black', 's': 20})
g.map_diag(sb.histplot, color = 'black', edgecolor = 'k', facecolor ='grey',
           kde = True, kde_kws = {'cut': 0.7}, line_kws = {"color": 'red'})
g.map_diag(sb.rugplot, color = 'black')
g.map_upper(corrdot)
g.map_upper(corrfunc)
g.fig.subplots_adjust(wspace = 0, hspace = 0)

# Remove axis labels
for ax in g.axes.flatten():
    ax.set_ylabel('')
    ax.set_xlabel('')

# Add titles to the diagonal axes/subplots
for ax, col in zip(np.diag(g.axes), data[cols].columns):
    ax.set_title(col, y=0.82, fontsize=26)

Data Profile Report¶

Comprehensive summary of the data, variables and their interactions

In [81]:
pr = ProfileReport(data)
In [82]:
# Profile Report should be here somewhere, uncomment below to plot it. 
# It slows down the file so I cleared it, temporarily
# I only report the original variables... no reason really

#pr
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[82]:

In [83]:
pr.to_file("taxi duration.html")
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
In [9]:
# The store_and_fwd_flag var seems to have no predictable power, and it is extremely unbalanced 

data.drop(["Unnamed: 0", "VendorID", "RatecodeID", "PULocationID", "DOLocationID", "payment_type", "week"], axis = 1).groupby("store_and_fwd_flag").describe()
Out[9]:
passenger_count trip_distance ... day_num hour
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
store_and_fwd_flag
N 22600.0 1.644115 1.287143 0.0 1.0 1.0 2.0 6.0 22600.0 2.911544 ... 5.0 6.0 22600.0 13.725531 6.228109 0.0 9.0 14.0 19.0 23.0
Y 99.0 1.232323 0.603194 1.0 1.0 1.0 1.0 4.0 99.0 3.317172 ... 4.0 6.0 99.0 13.959596 5.405815 0.0 11.0 15.0 19.0 23.0

2 rows × 112 columns

Trip duration vs distance¶

In [15]:
sb.scatterplot(data[["trip_distance", "duration_secs" ]], x = "trip_distance", y = "duration_secs")
Out[15]:
<AxesSubplot: xlabel='trip_distance', ylabel='duration_secs'>
In [16]:
data["duration_secs"].corr(data["trip_distance"])
Out[16]:
0.15360834125205425

There seems to be very little correlation between the two variables. However there are two different clusters, let's investigate!

We'll start with relationship when trips take longer that 80,000 seconds (>22 hours)

In [12]:
sb.scatterplot(data[data.duration_secs>80000][["trip_distance", "duration_secs" ]], x = "trip_distance", y = "duration_secs")
Out[12]:
<AxesSubplot: xlabel='trip_distance', ylabel='duration_secs'>
In [14]:
data[data.duration_secs>80000]["duration_secs"].corr(data[data.duration_secs>80000]["trip_distance"])
Out[14]:
0.26355866066221384

There is a weak positive relationship betwen trip duration and distance for trips longer than 22 hours, maybe these people just drive a little and the relax with their drivers, or they got murdered, or the drivers forgot to switch of the meter...the possibilities are endless.

Lets see what happens when the trips last less than 20000 seconds (~ <5 hours 30 minues)

In [13]:
sb.scatterplot(data[data.duration_secs<20000][["trip_distance", "duration_secs" ]], x = "trip_distance", y = "duration_secs")
Out[13]:
<AxesSubplot: xlabel='trip_distance', ylabel='duration_secs'>
In [15]:
data[data.duration_secs<20000]["duration_secs"].corr(data[data.duration_secs<20000]["trip_distance"])
Out[15]:
0.776168297801309

Well what do you know, suddenly a strong positive relationship, this is what I expected from the overall dataset

Trip Duration¶

The distribution of our interest variable is crazy positively skewed, I'll use natural log transformation below to fix it

In [10]:
sb.histplot(data.duration_secs)
Out[10]:
<AxesSubplot: xlabel='duration_secs', ylabel='Count'>
In [55]:
data[data.duration_secs < 0][['Unnamed: 0', 'VendorID', 'tpep_pickup_datetime','tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'payment_type', 'total_amount', 'fare_amount']]

# This observation has a negative duration; they were picked up before they were dropped off;
# Below I remove that observation, along with observation with 0 as duration, to be able to use a log transformation
Out[55]:
Unnamed: 0 VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance payment_type total_amount fare_amount
9356 93542707 1 2017-11-05 01:23:08 2017-11-05 01:06:09 1 5.7 3 29.3 28.0
In [100]:
data.drop(axis = 0 , index = data[data.duration_secs <= 0].index, inplace = True, errors = 'ignore')

# in hindsight I should have removed the 14 observations with negative values... they are not worth this trouble :-)
In [101]:
# Creating z-score to try to pin point outliers

data["duration_secs_log"] = np.log(data.duration_secs)
data["duration_secs_log_z"] = stats.zscore(data["duration_secs_log"])

Normal is the new normal

In [20]:
sb.histplot(data.duration_secs_log)

# the log-transformed duration is very normal, with some outliers
Out[20]:
<AxesSubplot: xlabel='duration_secs_log', ylabel='Count'>
In [82]:
# sample some outliers, using 3 zscore as the boundary for outliers

data[data["duration_secs_log_z"] >= 3].sample(5)
Out[82]:
Unnamed: 0 VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID ... day month tpep_pickup_time month_num day_num hour period_of_day time_of_day duration_secs_log duration_secs_log_z
9208 79635338 2 2017-09-22 09:20:53 2017-09-23 09:04:02 1 1.82 1 N 158 100 ... Fri Sep 09:20:53 8 4 9 My people 09:00 - 13:00 11.354973 5.725934
15579 41838754 2 2017-05-10 18:53:53 2017-05-11 18:53:02 5 0.74 1 N 161 162 ... Wed May 18:53:53 4 2 18 night travellers 17:00 - 21:00 11.366153 5.739057
1355 31453899 2 2017-04-17 21:26:49 2017-04-18 20:46:13 6 4.09 1 N 100 13 ... Mon Apr 21:26:49 3 0 21 Late night 21:00 - 01:00 11.338143 5.706181
21511 56793229 2 2017-07-02 15:45:27 2017-07-03 15:41:54 1 4.98 1 N 87 49 ... Sun Jul 15:45:27 6 6 15 Afternoon rush 13:00 - 17:00 11.364275 5.736853
11672 22404356 2 2017-03-18 14:58:31 2017-03-19 14:31:35 3 3.32 1 N 230 144 ... Sat Mar 14:58:31 2 5 14 Afternoon rush 13:00 - 17:00 11.347862 5.717588

5 rows × 31 columns

In [23]:
# lets check out them outliers on the left side

sb.histplot(data[data.duration_secs_log_z <= -3].duration_secs_log)
Out[23]:
<AxesSubplot: xlabel='duration_secs_log', ylabel='Count'>
In [25]:
# and on the right

sb.histplot(data[data.duration_secs_log_z >= 3].duration_secs_log)
Out[25]:
<AxesSubplot: xlabel='duration_secs_log', ylabel='Count'>
In [26]:
sb.scatterplot(data[(data.duration_secs_log_z < 3) & (data.duration_secs_log_z > -3)], x = 'duration_mins', y = 'trip_distance')

# The relationship between the trip duration and distance is wild, 
# looks like the residual of a horrible linear regression model... #heteroskedasticity #crazyVarianceThings
Out[26]:
<AxesSubplot: xlabel='duration_mins', ylabel='trip_distance'>

I'm curious if our new normal dataset follows the empirical rule for normal distributions

In [30]:
# 68%

f"About {((data.duration_secs_log_z >-1) & (data.duration_secs_log_z < 1)).mean():.2%}  of the data falls within 1 standard deviation of the mean, compared to 68% for a normal distribution"
Out[30]:
'About 74.27%  of the data falls within 1 standard deviation of the mean, compared to 68% for a normal distribution'
In [28]:
# 95%

f"{((data.duration_secs_log_z >-2) & (data.duration_secs_log_z < 2)).mean():.2%} of the data falls within 2 standard deviation of the mean, compared to 95% for a normal distribution"
Out[28]:
'96.79% of the data falls within 2 standard deviation of the mean, compared to 95% for a normal distribution'
In [29]:
# 99.7%

f"{((data.duration_secs_log_z >-3) & (data.duration_secs_log_z < 3)).mean():.2%}  of the data falls within 3 standard deviation of the mean, compared to 99.7% for a normal distribution"
Out[29]:
'99.09%  of the data falls within 3 standard deviation of the mean, compared to 99.7% for a normal distribution'

I mean....barely, but close enough

Central Limit Theorem¶

Just playing around with the clt to see if ya'll aren't lying to us!

This treat variables in the dataset as the population, makes a certain number of samples with sample size > 30 to see if the theorem holds

Duration in seconds¶

In [104]:
data.duration_secs.median()
Out[104]:
671.0
In [103]:
data.duration_secs.mean()
Out[103]:
1022.0872441778405
In [31]:
data.duration_secs_log.median()
Out[31]:
6.508769136971682
In [32]:
data.duration_secs_log.mean()
Out[32]:
6.476605878772017
In [33]:
sb.boxplot(data.duration_secs_log)
Out[33]:
<AxesSubplot: >
In [161]:
number_of_samples = 1500
sample_size = 50

dic_list = []

for x in range(number_of_samples):
    samp = data.duration_secs.sample(sample_size)
    item = {"mean":samp.mean(), "std":samp.std()}
    dic_list.append(item)

sample_dist = pd.DataFrame(dic_list)

sb.histplot(sample_dist["mean"])

print(f"Mean: { sample_dist['mean'].mean():.4f} \nStandard Error: {sample_dist['mean'].std():.4f}")
Mean: 1006.1934 
Standard Error: 497.4321

I mean................. regardless of the sample size or number of samples, there's always at least two (far apart) modes, up to 4 modes. There should just be one mode and approximately normal distibution.

In [162]:
sample_dist["lower"] = sample_dist['mean'] - (1.96* (sample_dist['std']/math.sqrt(sample_size)))
sample_dist["upper"] = sample_dist['mean'] + (1.96* (sample_dist['std']/math.sqrt(sample_size)))

sample_dist.head()
In [168]:
# checks how many of the above sample CIs include the population mean

perc = ((sample_dist["lower"] <= data.duration_secs.mean()) & (sample_dist["upper"] >= data.duration_secs.mean())).mean()

f"95% of confidence should contain the population mean. In this instance, about {perc:.2%} of the {number_of_samples} confidence intervals contain the population mean."
Out[168]:
'95% of confidence should contain the population mean. In this instance, about 59.33% of the 1500 confidence intervals contain the population mean.'

Trip Distance¶

In [146]:
sb.histplot(data['trip_distance'])
Out[146]:
<AxesSubplot: xlabel='trip_distance', ylabel='Count'>
In [148]:
print(f"Mean: {data['trip_distance'].mean():.4f} \nStandard Deviation: {data['trip_distance'].std():.4f}")
Mean: 2.9165 
Standard Deviation: 3.6540
In [170]:
number_of_samples = 1500
sample_size = 30

dic_list = []

for x in range(number_of_samples):
    samp = data.trip_distance.sample(sample_size)
    item = {"mean":samp.mean(), "std":samp.std()}
    dic_list.append(item)

sample_dist = pd.DataFrame(dic_list)

sb.histplot(sample_dist["mean"])

print(f"Mean: { sample_dist['mean'].mean():.4f} \nStandard Error: {sample_dist['mean'].std():.4f}")
Mean: 2.9119 
Standard Error: 0.6689
In [173]:
sample_dist["lower"] = sample_dist['mean'] - (1.96* (sample_dist['std']/math.sqrt(sample_size)))
sample_dist["upper"] = sample_dist['mean'] + (1.96* (sample_dist['std']/math.sqrt(sample_size)))

perc = ((sample_dist["lower"] <= data.trip_distance.mean()) & (sample_dist["upper"] >= data.trip_distance.mean())).mean()

f"95% of confidence should contain the population mean. In this instance, about {perc:.2%} of the {number_of_samples} confidence intervals contain the population mean."
Out[173]:
'95% of confidence should contain the population mean. In this instance, about 86.27% of the 1500 confidence intervals contain the population mean.'
In [178]:
from scipy import stats

stats.norm.interval(confidence = 0.95, loc = 2.591, scale = 3.684842/math.sqrt(30))
Out[178]:
(1.2724204546153968, 3.9095795453846036)

Outliers¶

Below are two variables of interest which contain a huge amount of outliers, including the dependent variable

Trip duration (in seconds) distribution with outliers

In [ ]:
sb.boxplot(data.duration_secs)
<AxesSubplot: >
In [ ]:
sb.boxplot(data.total_amount)
<AxesSubplot: >
In [13]:
# remove negative values

cols = data.select_dtypes(include = "number").columns

for col in cols:
    data.loc[data[col] < 0, col] = 0
In [121]:
from scipy.stats import zscore

# Ended up not using z scores because they still left many outliers
# In hindsight this would have been quicker if I'd just written a function to do than do this manually, individually
# We're correcting total amount, fare amount, trip distance, and duration in seconds

data["total_amount_z"] = zscore(data.total_amount)
data["duration_secs_z"] = zscore(data.duration_secs)

# Put a cap on both values, max value is now Q3 + IQR
# Total amount outliers
total_amount_upper = math.ceil(data["total_amount"].quantile(0.75) +  1.5*(data["total_amount"].quantile(0.75) - 
                                                                           data["total_amount"].quantile(0.25)))
total_amount_lower = math.floor(data["total_amount"].quantile(0.25) -  1.5*(data["total_amount"].quantile(0.75) - 
                                                                           data["total_amount"].quantile(0.25)))
data.loc[data.total_amount < 0, "total_amount"] = 0
data.loc[data.total_amount > total_amount_upper, "total_amount"] = total_amount_upper

# Trip distance outliers
trip_distance_upper = math.ceil(data["trip_distance"].quantile(0.75) +  1.5*(data["trip_distance"].quantile(0.75) - 
                                                                           data["trip_distance"].quantile(0.25)))
trip_distance_lower = math.floor(data["trip_distance"].quantile(0.25) -  1.5*(data["trip_distance"].quantile(0.75) - 
                                                                           data["trip_distance"].quantile(0.25)))

data.loc[data.trip_distance < 0, "trip_distance"] = 0
data.loc[data.trip_distance > trip_distance_upper, "trip_distance"] = trip_distance_upper
In [62]:
# Fare amount outliers
fare_amount_upper = math.ceil(data["fare_amount"].quantile(0.75) +  1.5*(data["fare_amount"].quantile(0.75) - 
                                                                           data["fare_amount"].quantile(0.25)))
fare_amount_lower = math.floor(data["fare_amount"].quantile(0.25) -  1.5*(data["fare_amount"].quantile(0.75) - 
                                                                           data["fare_amount"].quantile(0.25)))
data.loc[data.fare_amount < 0, "fare_amount"] = 0
data.loc[data.fare_amount > fare_amount_upper, "total_amount"] = fare_amount_upper

# Duration outliers
duration_secs_upper = math.ceil(data["duration_secs"].quantile(0.75) +  1.5*(data["duration_secs"].quantile(0.75) - 
                                                                           data["duration_secs"].quantile(0.25)))
duration_secs_lower = math.floor(data["duration_secs"].quantile(0.25) -  1.5*(data["duration_secs"].quantile(0.75) - 
                                                                           data["duration_secs"].quantile(0.25)))

data.loc[data.duration_secs < 0, "duration_secs"] = 0
data.loc[data.duration_secs > duration_secs_upper, "duration_secs"] = duration_secs_upper
In [ ]:
sb.boxplot(data.duration_secs)
<AxesSubplot: >
In [ ]:
sb.boxplot(data.total_amount)
<AxesSubplot: >

Hypothesis Testing¶

Student's t-test¶

Do the customers who use a credit card pay higher fare amounts than those who use cash?

That said, the TLC team is asking us to consider the following:

  • The relationship between fare amount and payment type.
  • Test the hypothesis that customers who use a credit card pay higher fare amounts.
  • Should you conclude that there is a statistically significant relationship between credit card payment and fare amount, discuss what the next steps should be: what are your thoughts on strategies our team could implement to encourage customers to pay with credit card?

EDA

In [6]:
data.fare_amount.describe()
Out[6]:
count    22699.000000
mean        13.026629
std         13.243791
min       -120.000000
25%          6.500000
50%          9.500000
75%         14.500000
max        999.990000
Name: fare_amount, dtype: float64
In [7]:
data.payment_type.value_counts()
Out[7]:
1    15265
2     7267
3      121
4       46
Name: payment_type, dtype: int64
In [17]:
sb.boxplot(data = data, y = "fare_amount", x = "payment_cats", showfliers = False)
Out[17]:
<AxesSubplot: xlabel='payment_cats', ylabel='fare_amount'>
In [24]:
data[["payment_cats", "fare_amount"]].groupby('payment_cats').describe()
Out[24]:
fare_amount
count mean std min 25% 50% 75% max
payment_cats
Cash 7267.0 12.213546 11.689940 0.0 6.0 9.0 14.000 450.00
Credit card 15265.0 13.429748 13.848964 0.0 7.0 9.5 15.000 999.99
Dispute 46.0 9.913043 24.162943 -120.0 5.0 8.5 17.625 52.00
No charge 121.0 12.186116 14.894232 -4.5 2.5 7.0 15.000 65.50
In [78]:
sb.scatterplot(x = data.trip_distance, y = data.duration_secs, hue = data.payment_cats)
Out[78]:
<AxesSubplot: xlabel='trip_distance', ylabel='duration_secs'>
In [87]:
#fig = plt.figure()
#ax = fig.add_subplot(projection ='3d')

fig = px.scatter_3d(data, x = 'trip_distance', y = 'duration_secs', z = 'total_amount', color = data.payment_cats)

fig.show()

Testing

Student's t-test for equality of means

  1. H0: There is no difference in the amounts paid by credit card and cash users

  2. H1: Customers who use a credit card pay higher fare amounts

  3. Significance level: 5%

In [17]:
cash = data[data.payment_cats == 'Cash'].sample(1000, replace = True, random_state = 28)

credit = data[data.payment_cats == 'Credit card'].sample(1000, replace = True, random_state = 28)
In [19]:
print(f"cash: {data[data.payment_cats == 'Cash'].fare_amount.mean()}\nCredit: {data[data.payment_cats == 'Credit card'].fare_amount.mean()}")
cash: 12.21354616760699
Credit: 13.429747789059942
In [18]:
print(f"cash: {cash.fare_amount.mean()}\nCredit: {credit.fare_amount.mean()}")
cash: 12.368
Credit: 12.9345
In [40]:
stats.ttest_ind(cash.fare_amount, credit.fare_amount, equal_var = False)
Out[40]:
Ttest_indResult(statistic=-1.242182632543156, pvalue=0.2143152228089723)
In [44]:
f'The p-value {stats.ttest_ind(cash.fare_amount, credit.fare_amount, equal_var = False).pvalue:.2%}'
Out[44]:
'The p-value 21.43%'

Conclusion: The p-value is greater than the significance level, hence we fail to reject the null hypothesis.

The probability that any difference in the fare amounts paid by credit card and cash users exists only by chance is fairly high, there is no statistical evidence to suggest there is any difference in the fares paid by the two groups.

There is no need to target marketing efforts towards users of any particular payment type.

Anova¶

Fare_amount vs Payment_type¶

H0: There is no difference in the amounts paid using any of the payment types

H1: At least one of the payment types has fare amounts different from the others

In [6]:
model_anova = ols(data = data, formula = "fare_amount ~ C(payment_type)").fit()

sm.stats.anova_lm(model_anova, typ = 2)
Out[6]:
sum_sq df F PR(>F)
C(payment_type) 7.816301e+03 3.0 14.881664 1.124138e-09
Residual 3.973367e+06 22695.0 NaN NaN

The p-value is significantly lower than the 5% significance level, hence probability of seeing an F-stat as or more extreme than 14.881664 is 0, we may reject the null hypothesis and conclude that the average fare amount paid by customers using the available payment methods is not the different.

Tukey HSD Test¶

This is to acertain the pair of payment types whose means are not the same.

H0: Mean fares paid by the customers is the same for each pair of payment type

H1: The mean amount paid by the customer is different for at least one pair of payment types

In [12]:
tukey = pairwise_tukeyhsd(endog = data.fare_amount, groups = data.payment_cats, alpha = 0.05)
In [14]:
tukey.summary()
Out[14]:
Multiple Comparison of Means - Tukey HSD, FWER=0.01
group1 group2 meandiff p-adj lower upper reject
Cash Credit card 1.2162 0.0 0.629 1.8034 True
Cash Dispute -2.3005 0.6424 -8.394 3.793 False
Cash No charge -0.0274 1.0 -3.8038 3.7489 False
Credit card Dispute -3.5167 0.2734 -9.6002 2.5668 False
Credit card No charge -1.2436 0.7319 -5.0037 2.5165 False
Dispute No charge 2.2731 0.7541 -4.8631 9.4092 False

Tukeys says there is no difference in the pairwise means of all the payment types except for cash and credit cards, which is in stark contast to the results of the student's t-test. The difference in the outcomes of the tests is solely due to the use of a smaller, random sub-sample for the t-test and using the entire dataset for the anova and tukey tests. Using a much larger sub-sample for the t-tests confirms the results of the latter two tests.

In [30]:
stats.ttest_ind(data[data.payment_cats == "Cash"].fare_amount.sample(7000, replace = True, random_state = 28), data[data.payment_cats == "Credit card"].fare_amount.sample(7000, replace = True, random_state = 28), equal_var = False)
Out[30]:
Ttest_indResult(statistic=-4.485339416872243, pvalue=7.34308239466814e-06)

Trip Duration vs Time periods¶

In [44]:
data.columns
Out[44]:
Index(['Unnamed: 0', 'VendorID', 'tpep_pickup_datetime',
       'tpep_dropoff_datetime', 'passenger_count', 'trip_distance',
       'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID',
       'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
       'tolls_amount', 'improvement_surcharge', 'total_amount', 'payment_cats',
       'duration_secs', 'duration_mins', 'week', 'day', 'month',
       'tpep_pickup_time', 'month_num', 'day_num', 'hour', 'period_of_day',
       'time_of_day', 'time', 'minutes'],
      dtype='object')
In [45]:
model_hour = ols(data = data, formula = "duration_secs ~ C(hour)").fit()

sm.stats.anova_lm(model_hour, typ = 2)
Out[45]:
sum_sq df F PR(>F)
C(hour) 2.714466e+08 23.0 0.852816 0.664989
Residual 3.137969e+11 22675.0 NaN NaN

Chi Squared¶

Goodness of fit:

H0: Trip durations are the same for each month, day, and time on average.

H1: Trip durations are not the same across these periods

In [17]:
# Duration of trips are the same for each month. 
# Calculating the expected monthly durations

month_gof = data[["month", "duration_secs"]].groupby("month").mean().reset_index().copy()

month_gof["duration_exp"] = data.duration_secs.mean()

month_gof.style.background_gradient(cmap="Reds")
Out[17]:
  month duration_secs duration_exp
0 Jan 825.831247 1020.826600
1 Feb 904.775014 1020.826600
2 Mar 983.993655 1020.826600
3 Apr 1128.545319 1020.826600
4 May 1037.839046 1020.826600
5 Jun 1157.679735 1020.826600
6 Jul 1081.131408 1020.826600
7 Aug 897.383991 1020.826600
8 Sep 1013.831027 1020.826600
9 Oct 922.344351 1020.826600
10 Nov 998.272382 1020.826600
11 Dec 1296.436393 1020.826600
In [ ]:
month_chi = stats.chisquare(month_gof.duration_secs, month_gof.duration_exp, axis=0)

month_chi.pvalue()

# scipy is being childish so I'll finish this later... or not probably
In [15]:
data[["day", "duration_secs"]].groupby("day").mean().reset_index().style.background_gradient(cmap="Reds")
Out[15]:
  day duration_secs
0 Mon 925.504606
1 Tue 916.036898
2 Wed 1090.023599
3 Thu 1113.103762
4 Fri 1072.891591
5 Sat 975.732997
6 Sun 1034.213476

Linear Regression¶

In [105]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Create a copy of original dataset, standadize predictors, 
# split X and Y variables, the split training and testing sets
# We're skipping validation and cross-validation for this model

df = data.copy()

Y = df[["duration_secs"]]
X = df[["Unnamed: 0",'trip_distance','RatecodeID','payment_type', 'extra', 'mta_tax', 'tip_amount','tolls_amount', 'improvement_surcharge']]

X = X.loc[:, ~X.columns.str.contains("Unnamed")]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y,  test_size=0.2, random_state = 28)

X_train = pd.DataFrame(X_train, columns = X.columns)
X_test = pd.DataFrame(X_test, columns = X.columns)

X_train
Out[105]:
trip_distance RatecodeID payment_type extra mta_tax tip_amount tolls_amount improvement_surcharge
0 -0.647159 -0.139036 -0.677904 0.360332 0.064119 -0.299011 -0.223514 0.028072
1 -0.896884 -0.139036 1.339082 -0.719283 0.064119 -0.656009 -0.223514 0.028072
2 0.126990 -0.139036 -0.677904 -0.719283 0.064119 0.754133 -0.223514 0.028072
3 0.052072 -0.139036 -0.677904 0.360332 0.064119 0.432835 -0.223514 0.028072
4 2.299600 -0.139036 -0.677904 0.360332 0.064119 1.739447 3.890853 0.028072
... ... ... ... ... ... ... ... ...
18132 -0.986786 -0.139036 -0.677904 0.360332 0.064119 -0.141932 -0.223514 0.028072
18133 1.850095 -0.139036 -0.677904 1.439947 0.064119 1.039731 -0.223514 0.028072
18134 -0.966808 -0.139036 1.339082 -0.719283 0.064119 -0.656009 -0.223514 0.028072
18135 -0.831956 -0.139036 1.339082 -0.719283 0.064119 -0.656009 -0.223514 0.028072
18136 -0.727071 -0.139036 1.339082 -0.719283 0.064119 -0.656009 -0.223514 0.028072

18137 rows × 8 columns

In [103]:
# Build and fit model to the training data

model_lr = LinearRegression()
model_lr.fit(X_train,Y_train)
Out[103]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Model eval¶

In [104]:
r_sq = model_lr.score(X_train, Y_train)
print(f"Coefficient of determination: {r_sq:.2%}")
Y_pred = model_lr.predict(X_train)
print(f"R^2: {r2_score(Y_train, Y_pred):.2%}")
print(f"MAE: {mean_absolute_error(Y_train,Y_pred)}")
print(f"RMSE:{np.sqrt(mean_squared_error(Y_train, Y_pred))}")
Coefficient of determination: 51.19%
R^2: 51.19%
MAE: 0.39724078250029016
RMSE:0.5595480873756041
In [82]:
r_sq_test = model_lr.score(X_test, Y_test)
print("Coefficient of determination:", r_sq_test)
Y_pred_test = model_lr.predict(X_test)
print("R^2:", r2_score(Y_test, Y_pred_test))
print("MAE:", mean_absolute_error(Y_test,Y_pred_test))
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_test)))
Coefficient of determination: 0.6577045130990453
R^2: 0.6577045130990453
MAE: 239.21455013245085
RMSE: 323.27013412163654
In [ ]:
'''import statsmodels

formula = "data.duration_secs ~ 'data.trip_distance'+ 'data.RatecodeID'+'data.payment_type'+'data.extra'+'data.mta_tax'+ 'data.tip_amount'+'data.tolls_amount'+ 'data.improvement_surcharge'"

model_ols = ols(data = data, formula = formula)

model_ols.fit().summary()

model_res = model_ols.fit_regularized(method = "elastic_net")

pinv_wexog, _ = pinv_extended(model_ols.wexog)
normalized_cov_params = np.dot(pinv_wexog, np.transpose(pinv_wexog))
summary = statsmodels.regression.linear_model.OLSResults(model_ols, model_res.params, normalized_cov_params)
summary.summary()'''
In [91]:
results = pd.DataFrame(data={"actual": Y_test["duration_secs"],
                             "predicted": Y_pred_test.ravel()})
results["residual"] = results["actual"] - results["predicted"]
results.head()
Out[91]:
actual predicted residual
146 520.0 598.243338 -78.243338
19350 283.0 431.842645 -148.842645
20011 515.0 550.199015 -35.199015
13009 1620.0 1470.542291 149.457709
3822 674.0 506.622903 167.377097
In [109]:
sb.set(style='whitegrid')
plt.figure(figsize = (12,12))
sb.regplot(x="actual", y="predicted", data = results)
#plt.show()
Out[109]:
<AxesSubplot: xlabel='actual', ylabel='predicted'>
In [108]:
plt.hist(results["residual"], bins=30)
plt.title("Distribution of the residuals")
plt.xlabel("residual value")
plt.ylabel("count")

# We're happy our errors are normally distributed
Out[108]:
Text(0, 0.5, 'count')
In [110]:
sb.scatterplot(x = "predicted", y = "residual", data = results)
plt.axhline(0)
plt.title("Scatterplot of residuals over predicted values")
plt.xlabel("predicted value")
plt.ylabel("residual value")

# residuals are a cloud of nonsense, yay!!

Model Dashboard¶

In [83]:
import explainerdashboard
from explainerdashboard import RegressionExplainer, ClassifierExplainer, ExplainerDashboard

feature_descriptions = {'trip_distance': "Duration of the trip in miles",
                        'RatecodeID': "Category of the tarrif being used and charged",
                        'payment_type': "The payment method used by the passenger",
                        'extra': "Extras",
                        'mta_tax': "Tax",
                        'tip_amount': "Tip paid by passenger",
                        'tolls_amount': "Amount paid at toll",
                        'improvement_surcharge': ""}


explainer = RegressionExplainer(model_lr, X_test, Y_test, units = "s", descriptions = feature_descriptions
                                )
dashboard = ExplainerDashboard(explainer, title = "Taxi Trip Duration Estimator")
dashboard.run()
WARNING: For shap='linear', shap interaction values can unfortunately not be calculated!
Warning: shap values for shap.LinearExplainer get calculated against X_background, but paramater X_background=None, so using X instead
Generating self.shap_explainer = shap.LinearExplainer(modelX)...
Building ExplainerDashboard..
WARNING: the number of idxs (=4540) > max_idxs_in_dropdown(=1000). However with your installed version of dash(2.9.3) dropdown search may not work smoothly. You can downgrade to `pip install dash==2.6.2` which should work better for now...
Detected notebook environment, consider setting mode='external', mode='inline' or mode='jupyterlab' to keep the notebook interactive while the dashboard is running...
For this type of model and model_output interactions don't work, so setting shap_interaction=False...
The explainer object has no decision_trees property. so setting decision_trees=False...
Generating layout...
Calculating shap values...
Calculating predictions...
Calculating residuals...
Calculating absolute residuals...
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
Calculating dependencies...
Calculating importances...
Reminder: you can store the explainer (including calculated dependencies) with explainer.dump('explainer.joblib') and reload with e.g. ClassifierExplainer.from_file('explainer.joblib')
Registering callbacks...
Starting ExplainerDashboard on http://10.0.0.107:8050
Dash is running on http://0.0.0.0:8050/

 * Serving Flask app 'explainerdashboard.dashboards'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8050
 * Running on http://10.0.0.107:8050
Press CTRL+C to quit
10.0.0.107 - - [13/May/2023 20:47:30] code 400, message Bad request version ('\x00\x02\x01\x00\x00+\x00\x07\x06ºº\x03\x04\x03\x03\x00\x12\x00\x00\x00')
äëÙëÊõù÷Eå_Dih2+ººü5¯j[K´ô"òàML…Ú­÷íšÕµ~‰'T“±µpY ÇqCaR”¢ê|Î<èØåGޝ%`ez(›ºÉ" ššÀ+À/À,À0̨̩ÀÀœ/5“úú-ÿ#3+)ÊÊ ¢…ô;`¨LúE}wEYL':ó" HTTPStatus.BAD_REQUEST -
10.0.0.107 - - [13/May/2023 20:47:30] code 400, message Bad request version ('\x00\x14\x00\x12\x04\x03\x08\x04\x04\x01\x05\x03\x08\x05\x05\x01\x08\x06\x06\x01\x02\x01\x00')
3/May/2023 20:47:30] "ü‘òÝRÑVèºéROawæa!&ßÍ!F6\N^'¬ÎG €åŸý×—ô™¬Ž ?Wî‡o[ç_€¼>ÄÔfjÊS ššÀ+À/À,À0̨̩ÀÀœ/5“úú+úúÿ" HTTPStatus.BAD_REQUEST -
10.0.0.107 - - [13/May/2023 20:47:34] "GET / HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /assets/bootstrap.min.css?m=1683963037.3732884 HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash/deps/polyfill@7.v2_9_3m1681548772.12.1.min.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash/deps/react@16.v2_9_3m1681548772.14.0.min.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash/deps/react-dom@16.v2_9_3m1681548772.14.0.min.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash/deps/prop-types@15.v2_9_3m1681548772.8.1.min.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash_bootstrap_components/_components/dash_bootstrap_components.v1_4_1m1683963036.min.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash/dash-renderer/build/dash_renderer.v2_9_3m1681548772.min.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash/dcc/dash_core_components-shared.v2_9_2m1681548772.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash/dcc/dash_core_components.v2_9_2m1681548772.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:37] "GET /_dash-component-suites/dash/html/dash_html_components.v2_0_11m1681548773.min.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:38] "GET /_dash-component-suites/dash/dash_table/bundle.v5_2_4m1681548772.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:38] "GET /_dash-dependencies HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:38] "GET /_dash-layout HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:38] "GET /assets/favicon.ico?m=1683963037.3762867 HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:38] "GET /_dash-component-suites/dash/dcc/async-graph.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:39] "GET /_dash-component-suites/dash/dcc/async-plotlyjs.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:39] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:41] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:45] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:45] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:45] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:45] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:45] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:45] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:46] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:46] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:46] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:46] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:46] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:46] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:47:46] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:46] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:47] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:47:59] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:48:05] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:48:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:48:14] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:49:01] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:49:30] "GET /_dash-component-suites/dash/dcc/async-dropdown.js HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:49:30] "GET /_dash-component-suites/dash/dcc/async-slider.js HTTP/1.1" 200 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
10.0.0.107 - - [13/May/2023 20:50:00] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:50:33] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:50:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:04] "GET / HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:06] "GET /assets/bootstrap.min.css?m=1683963037.3732884 HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:51:06] "GET /_dash-dependencies HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:07] "GET /_dash-layout HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:07] "GET /_dash-component-suites/dash/dcc/async-graph.js HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:51:07] "GET /_dash-component-suites/dash/dcc/async-plotlyjs.js HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:51:07] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:07] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:07] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:07] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:08] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:09] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:09] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:09] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:09] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:09] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:11] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:12] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:13] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:13] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:13] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:13] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:13] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:13] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:13] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:13] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:19] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:27] "GET /_dash-component-suites/dash/dcc/async-slider.js HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:51:28] "GET /_dash-component-suites/dash/dcc/async-dropdown.js HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:51:34] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:34] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:51:35] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:35] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:35] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:35] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:36] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:36] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:37] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:37] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:38] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:51:55] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:12] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:30] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:30] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:30] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:31] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:31] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:31] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:31] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:31] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:31] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:31] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:31] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:32] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:32] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:32] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:52:52] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:53:25] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:53:28] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:53:35] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:53:38] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:53:55] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:54:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:54:18] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:54:20] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:54:22] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:54:53] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:54:58] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:05] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:24] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:33] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:38] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:45] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:49] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:55:51] "POST /_dash-update-component HTTP/1.1" 200 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
[2023-05-13 20:56:40,133] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 283, in download_html
    return dict(content=self.to_html(state_dict), filename="dashboard.html")
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 238, in to_html
    tabs = {
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 239, in <dictcomp>
    tab.title: tab.to_html(state_dict, add_header=False) for tab in self.tabs
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 280, in to_html
    fig = self.explainer.plot_importances_detailed(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1901, in plot_importances_detailed
    cols = self.get_importances_df(kind="shap", topx=topx, pos_label=pos_label)[
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 20:56:40] "POST /_dash-update-component HTTP/1.1" 500 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
[2023-05-13 20:57:05,489] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 283, in download_html
    return dict(content=self.to_html(state_dict), filename="dashboard.html")
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 238, in to_html
    tabs = {
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 239, in <dictcomp>
    tab.title: tab.to_html(state_dict, add_header=False) for tab in self.tabs
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 280, in to_html
    fig = self.explainer.plot_importances_detailed(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1901, in plot_importances_detailed
    cols = self.get_importances_df(kind="shap", topx=topx, pos_label=pos_label)[
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 20:57:05] "POST /_dash-update-component HTTP/1.1" 500 -
10.0.0.107 - - [13/May/2023 20:57:14] "POST /_dash-update-component HTTP/1.1" 200 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
[2023-05-13 20:57:25,702] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 283, in download_html
    return dict(content=self.to_html(state_dict), filename="dashboard.html")
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 238, in to_html
    tabs = {
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 239, in <dictcomp>
    tab.title: tab.to_html(state_dict, add_header=False) for tab in self.tabs
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 280, in to_html
    fig = self.explainer.plot_importances_detailed(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1901, in plot_importances_detailed
    cols = self.get_importances_df(kind="shap", topx=topx, pos_label=pos_label)[
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 20:57:25] "POST /_dash-update-component HTTP/1.1" 500 -
10.0.0.107 - - [13/May/2023 20:58:37] "GET / HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:38] "GET /assets/bootstrap.min.css?m=1683963037.3732884 HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:58:39] "GET /_dash-layout HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:39] "GET /_dash-dependencies HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:39] "GET /assets/favicon.ico?m=1683963037.3762867 HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:58:39] "GET /_dash-component-suites/dash/dcc/async-graph.js HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:58:39] "GET /_dash-component-suites/dash/dcc/async-plotlyjs.js HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:58:39] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:39] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:40] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:41] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:42] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:43] "POST /_dash-update-component HTTP/1.1" 204 -
10.0.0.107 - - [13/May/2023 20:58:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:44] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:47] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:58:56] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:00] "GET /_dash-component-suites/dash/dcc/async-dropdown.js HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:59:00] "GET /_dash-component-suites/dash/dcc/async-slider.js HTTP/1.1" 304 -
10.0.0.107 - - [13/May/2023 20:59:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:02] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:03] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:03] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:03] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:03] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:03] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:03] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:10] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:11] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 20:59:17] "POST /_dash-update-component HTTP/1.1" 200 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
[2023-05-13 20:59:26,650] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 283, in download_html
    return dict(content=self.to_html(state_dict), filename="dashboard.html")
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 238, in to_html
    tabs = {
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 239, in <dictcomp>
    tab.title: tab.to_html(state_dict, add_header=False) for tab in self.tabs
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 276, in to_html
    fig = self.explainer.plot_importances(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1830, in plot_importances
    importances_df = self.get_importances_df(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 20:59:26] "POST /_dash-update-component HTTP/1.1" 500 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
[2023-05-13 20:59:56,166] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 283, in download_html
    return dict(content=self.to_html(state_dict), filename="dashboard.html")
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 238, in to_html
    tabs = {
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 239, in <dictcomp>
    tab.title: tab.to_html(state_dict, add_header=False) for tab in self.tabs
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 276, in to_html
    fig = self.explainer.plot_importances(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1830, in plot_importances
    importances_df = self.get_importances_df(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 20:59:56] "POST /_dash-update-component HTTP/1.1" 500 -
10.0.0.107 - - [13/May/2023 21:00:48] "POST /_dash-update-component HTTP/1.1" 200 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
[2023-05-13 21:00:54,618] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 283, in download_html
    return dict(content=self.to_html(state_dict), filename="dashboard.html")
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 238, in to_html
    tabs = {
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 239, in <dictcomp>
    tab.title: tab.to_html(state_dict, add_header=False) for tab in self.tabs
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 276, in to_html
    fig = self.explainer.plot_importances(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1830, in plot_importances
    importances_df = self.get_importances_df(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 21:00:54] "POST /_dash-update-component HTTP/1.1" 500 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
10.0.0.107 - - [13/May/2023 21:00:57] "POST /_dash-update-component HTTP/1.1" 200 -
10.0.0.107 - - [13/May/2023 21:01:02] "POST /_dash-update-component HTTP/1.1" 200 -
[2023-05-13 21:01:16,663] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 287, in download_html
    content=tab.to_html(state_dict), filename="dashboard.html"
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 276, in to_html
    fig = self.explainer.plot_importances(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1830, in plot_importances
    importances_df = self.get_importances_df(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 21:01:16] "POST /_dash-update-component HTTP/1.1" 500 -
Warning: mean-absolute-percentage-error is very large (2101690679724542.8), you can hide it from the metrics by passing parameter show_metrics...
[2023-05-13 21:01:21,728] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 283, in download_html
    return dict(content=self.to_html(state_dict), filename="dashboard.html")
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 238, in to_html
    tabs = {
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 239, in <dictcomp>
    tab.title: tab.to_html(state_dict, add_header=False) for tab in self.tabs
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 276, in to_html
    fig = self.explainer.plot_importances(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1830, in plot_importances
    importances_df = self.get_importances_df(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 21:01:21] "POST /_dash-update-component HTTP/1.1" 500 -
[2023-05-13 21:01:25,417] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Jason\anaconda3\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\dash.py", line 1283, in dispatch
    ctx.run(
  File "c:\Users\Jason\anaconda3\lib\site-packages\dash\_callback.py", line 450, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboards.py", line 287, in download_html
    content=tab.to_html(state_dict), filename="dashboard.html"
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\composites.py", line 910, in to_html
    self.shap_summary.to_html(state_dict, add_header=False),
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\dashboard_components\shap_components.py", line 276, in to_html
    fig = self.explainer.plot_importances(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1830, in plot_importances
    importances_df = self.get_importances_df(
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1558, in get_importances_df
    return self.get_mean_abs_shap_df(topx, cutoff, pos_label)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 66, in inner
    return func(self, *args, **kwargs)
  File "C:\Users\Jason\AppData\Roaming\Python\Python310\site-packages\explainerdashboard\explainers.py", line 1335, in get_mean_abs_shap_df
    return shap_df[shap_df["MEAN_ABS_SHAP"] >= cutoff].head(topx)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\generic.py", line 5547, in head
    return self.iloc[:n]
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1602, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1637, in _get_slice_axis
    labels._validate_positional_slice(slice_obj)
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4212, in _validate_positional_slice
    self._validate_indexer("positional", key.stop, "iloc")
  File "c:\Users\Jason\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 6591, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Int64Index with these indexers [8] of type str
10.0.0.107 - - [13/May/2023 21:01:25] "POST /_dash-update-component HTTP/1.1" 500 -

Machine Learning¶

Client wants to know if we can create a model to predict gratuity at the end of trips as their taxi drivers are trying to make money. This will be deployed on taxi driver's apps with passengers' profiles.

  • The initial idea was to build a regressor, however to allay ethical concerns, the model being built will not predict the amount of gratuity, a classifier will be built instead.
  • To avoid passengers being stranded because they might not tip (uber could learn one or two things here...)
  • And to avoid angry drivers who were promised tips by the model who did not get anything from passengers.

The model will predict whether the client will tip more than 20% of the fare, this will be a binary variable

EDA¶

In [111]:
data.columns
Out[111]:
Index(['Unnamed: 0', 'VendorID', 'tpep_pickup_datetime',
       'tpep_dropoff_datetime', 'passenger_count', 'trip_distance',
       'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID',
       'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
       'tolls_amount', 'improvement_surcharge', 'total_amount',
       'duration_secs', 'duration_mins', 'week', 'day', 'month',
       'tpep_pickup_time', 'month_num', 'day_num', 'hour', 'period_of_day',
       'time_of_day', 'payment_cats', 'minutes', 'time', 'total_amount_z',
       'duration_secs_z', 'duration_secs_log', 'duration_secs_log_z'],
      dtype='object')

Exploring tip variable

In [13]:
data.tip_amount.describe()
Out[13]:
count   22699.00
mean        1.84
std         2.80
min         0.00
25%         0.00
50%         1.35
75%         2.45
max       200.00
Name: tip_amount, dtype: float64
In [16]:
sb.boxplot(data.tip_amount, showfliers = False)
Out[16]:
<AxesSubplot: >
In [17]:
# %matplotlib inline

sb.histplot(data.tip_amount)
Out[17]:
<AxesSubplot: xlabel='tip_amount', ylabel='Count'>
In [19]:
sb.histplot(data.tip_amount)
plt.xlim(left = 0, right = 6.5)
Out[19]:
(0.0, 6.5)

Tip as a percentage of the fare

In [14]:
data["tip_percent"] = 100 * data.tip_amount/data.fare_amount
In [15]:
data.tip_percent.describe()
Out[15]:
count    22679.000000
mean        14.405414
std         13.991558
min          0.000000
25%          0.000000
50%         18.285714
75%         22.909091
max        560.000000
Name: tip_percent, dtype: float64
In [16]:
sb.boxplot(data.tip_percent, showfliers = False)
Out[16]:
<AxesSubplot: >
In [33]:
sb.histplot(data.tip_percent, kde = True, stat = "percent")
plt.xlim(left = -1, right = 60)

# about 36% of ya'll don't dont tip!!
Out[33]:
(-1.0, 60.0)
In [17]:
data[(data.tip_percent > 0) & (data.tip_percent < 60)].tip_percent.describe()

# Tip percentages without the outliers and the zeros
Out[17]:
count    14591.000000
mean        21.924229
std          6.603428
min          0.029412
25%         20.000000
50%         22.133333
75%         24.727273
max         58.823529
Name: tip_percent, dtype: float64
In [18]:
sb.boxplot(data[(data.tip_percent > 0) & (data.tip_percent < 50)].tip_percent)
Out[18]:
<AxesSubplot: >
In [19]:
data["good_tip"] = np.select([data.tip_percent < 20, data.tip_percent >= 20], [0, 1])

data.good_tip.value_counts()
Out[19]:
0    11581
1    11118
Name: good_tip, dtype: int64
In [60]:
data[["good_tip", "tip_percent"]]
Out[60]:
good_tip tip_percent
0 1 21.23
1 1 25.00
2 1 22.31
3 1 31.17
4 0 0.00
... ... ...
22694 0 0.00
22695 1 28.15
22696 0 0.00
22697 0 16.19
22698 1 21.36

22699 rows × 2 columns

Feature Selection¶

Tips per day

In [39]:
data[["month", "day", "tip_amount"]].groupby("day").mean(numeric_only = True).style.background_gradient()
Out[39]:
  tip_amount
day  
Mon 1.935827
Tue 1.865009
Wed 1.916667
Thu 1.900791
Fri 1.842406
Sat 1.638966
Sun 1.755060

Tips per month

In [41]:
data[["month", "day", "tip_amount"]].groupby("month").mean(numeric_only = True).style.background_gradient()
Out[41]:
  tip_amount
month  
Jan 1.790110
Feb 1.934585
Mar 1.810137
Apr 1.780248
May 1.943835
Jun 1.839027
Jul 1.656730
Aug 1.714292
Sep 1.806943
Oct 1.848757
Nov 1.979029
Dec 1.905668

Payment categories

In [47]:
data[["payment_cats", "tip_amount"]].groupby("payment_cats").count()
# granted this data was self-reported... 
# we only have credit card tips and will only be able to use that data, exclude other payment types
Out[47]:
tip_amount
payment_cats
Cash 7267
Credit card 15265
Dispute 46
No charge 121

Passenger Count

In [50]:
data.passenger_count.value_counts()

# I just noticed that there are rides with ghost passengers, 33 to be specific
Out[50]:
1    16117
2     3305
5     1143
3      953
6      693
4      455
0       33
Name: passenger_count, dtype: int64
In [49]:
sb.barplot(data = data[["passenger_count", "tip_amount"]].groupby("passenger_count").mean().reset_index(), x = data.passenger_count, y = data.tip_amount)
Out[49]:
<AxesSubplot: xlabel='passenger_count', ylabel='tip_amount'>

Correlation matrix

In [61]:
cols = ["tip_amount", "payment_type", "trip_distance", "duration_secs", "fare_amount", "tolls_amount", "passenger_count"]
In [62]:
sb.set(style='white', font_scale=1.6)
np.seterr(invalid = 'ignore')
g = sb.PairGrid(data[cols], aspect=1.5, diag_sharey=False, despine=False)
g.map_lower(sb.regplot, lowess = True, ci=False,
            line_kws={'color': 'red', 'lw': 1},
            scatter_kws={'color': 'black', 's': 20})
g.map_diag(sb.histplot, color = 'black', edgecolor = 'k', facecolor ='grey',
           kde = True, kde_kws = {'cut': 0.7}, line_kws = {"color": 'red'})
g.map_diag(sb.rugplot, color = 'black')
g.map_upper(corrdot)
g.map_upper(corrfunc)
g.fig.subplots_adjust(wspace = 0, hspace = 0)

# Remove axis labels
for ax in g.axes.flatten():
    ax.set_ylabel('')
    ax.set_xlabel('')

# Add titles to the diagonal axes/subplots
for ax, col in zip(np.diag(g.axes), data[cols].columns):
    ax.set_title(col, y=0.82, fontsize=26)
In [20]:
data["period_of_day2"].value_counts(normalize= True)
Out[20]:
Night owls      0.317195
Day-lighers     0.297810
Evening Rush    0.231332
Morning Rush    0.153663
Name: period_of_day2, dtype: float64

Encode VendorID

currently has 1 and 2 as values

In [21]:
print(data.VendorID.value_counts())

data.VendorID = data.VendorID - 1
2    12626
1    10073
Name: VendorID, dtype: int64

Converting all remaining categories into dummies

In [75]:
data.columns
Out[75]:
Index(['Unnamed: 0', 'VendorID', 'tpep_pickup_datetime',
       'tpep_dropoff_datetime', 'passenger_count', 'trip_distance',
       'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID',
       'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
       'tolls_amount', 'improvement_surcharge', 'total_amount',
       'duration_secs', 'duration_mins', 'week', 'day', 'month',
       'tpep_pickup_time', 'month_num', 'day_num', 'hour', 'period_of_day',
       'time_of_day', 'payment_cats', 'minutes', 'time', 'tip_percent',
       'good_tip', 'period_of_day2'],
      dtype='object')
In [23]:
data_ml = data.query("payment_cats == 'Credit card'").copy()

data_ml.columns
Out[23]:
Index(['Unnamed: 0', 'VendorID', 'tpep_pickup_datetime',
       'tpep_dropoff_datetime', 'passenger_count', 'trip_distance',
       'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID',
       'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
       'tolls_amount', 'improvement_surcharge', 'total_amount',
       'duration_secs', 'duration_mins', 'week', 'day', 'month',
       'tpep_pickup_time', 'month_num', 'day_num', 'hour', 'period_of_day',
       'time_of_day', 'period_of_day2', 'payment_cats', 'minutes', 'time',
       'tip_percent', 'good_tip'],
      dtype='object')
In [24]:
for col in ['RatecodeID', 'PULocationID', 'DOLocationID']:
    data_ml[col] = data_ml[col].astype(str)
In [25]:
data_ml.drop(['Unnamed: 0', 'tpep_pickup_datetime','tpep_dropoff_datetime', 'store_and_fwd_flag','payment_type', 
           'fare_amount', 'extra', 'mta_tax', 'tip_amount','tolls_amount', 'improvement_surcharge', 'duration_mins', 
           'week','tpep_pickup_time', 'month_num', 'day_num', 'hour', 'period_of_day','time_of_day',"tip_percent",
           "minutes", "time", 'payment_cats'], 
          axis = 1, inplace = True)

data_ml.columns
Out[25]:
Index(['VendorID', 'passenger_count', 'trip_distance', 'RatecodeID',
       'PULocationID', 'DOLocationID', 'total_amount', 'duration_secs', 'day',
       'month', 'period_of_day2', 'good_tip'],
      dtype='object')
In [26]:
temp = pd.get_dummies(data_ml)
In [27]:
temp.columns
Out[27]:
Index(['VendorID', 'passenger_count', 'trip_distance', 'total_amount',
       'duration_secs', 'good_tip', 'RatecodeID_1', 'RatecodeID_2',
       'RatecodeID_3', 'RatecodeID_4',
       ...
       'month_Jul', 'month_Aug', 'month_Sep', 'month_Oct', 'month_Nov',
       'month_Dec', 'period_of_day2_Day-lighers',
       'period_of_day2_Evening Rush', 'period_of_day2_Morning Rush',
       'period_of_day2_Night owls'],
      dtype='object', length=352)

Train test Split

In [74]:
y = temp.good_tip

x = temp.drop("good_tip", axis = 1)

x_train, x_test, y_train, y_test = train_test_split(x, y, stratify = y, train_size = 0.8, random_state = 28)

x_train.sample(3)
Out[74]:
VendorID passenger_count trip_distance total_amount duration_secs RatecodeID_1 RatecodeID_2 RatecodeID_3 RatecodeID_4 RatecodeID_5 ... month_Jul month_Aug month_Sep month_Oct month_Nov month_Dec period_of_day2_Day-lighers period_of_day2_Evening Rush period_of_day2_Morning Rush period_of_day2_Night owls
6894 1 1 1.51 11.30 525.00 1 0 0 0 0 ... 0 0 0 1 0 0 0 1 0 0
15337 1 1 1.02 10.56 459.00 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
9685 0 1 0.60 7.55 223.00 1 0 0 0 0 ... 0 0 0 1 0 0 0 1 0 0

3 rows × 351 columns

In [39]:
100*y.value_counts(normalize=True)
Out[39]:
1   72.83
0   27.17
Name: good_tip, dtype: float64
In [38]:
100*y_train.value_counts(normalize=True)
Out[38]:
1   72.83
0   27.17
Name: good_tip, dtype: float64

Modeling¶

Next we fit 29 classification models to serve as baselines

Baseline¶

In [29]:
from lazypredict.Supervised import LazyClassifier

classifer = LazyClassifier(predictions= True, verbose = 0, ignore_warnings = False, 
                           custom_metric = None, random_state = 28)
In [30]:
model_lazy, predictions = classifer.fit(x_train, x_test, y_train, y_test)

model_lazy_dictionary = classifer.provide_models(x_train, x_test, y_train, y_test)
 17%|█▋        | 5/29 [01:46<09:59, 24.97s/it]
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
 90%|████████▉ | 26/29 [09:15<01:42, 34.07s/it]
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'
100%|██████████| 29/29 [09:28<00:00, 19.62s/it]
In [35]:
model_lazy.sort_values(by = "F1 Score", ascending = False)
Out[35]:
Accuracy Balanced Accuracy ROC AUC F1 Score Time Taken
Model
XGBClassifier 0.93 0.90 0.90 0.93 11.24
LGBMClassifier 0.91 0.86 0.86 0.91 1.75
BaggingClassifier 0.89 0.85 0.85 0.89 5.48
DecisionTreeClassifier 0.86 0.82 0.82 0.86 1.23
RandomForestClassifier 0.75 0.56 0.56 0.68 11.08
AdaBoostClassifier 0.76 0.55 0.55 0.68 7.02
SGDClassifier 0.70 0.53 0.53 0.65 3.28
LogisticRegression 0.73 0.52 0.52 0.65 2.28
NuSVC 0.72 0.52 0.52 0.65 116.64
BernoulliNB 0.72 0.52 0.52 0.64 0.66
LinearSVC 0.72 0.52 0.52 0.64 23.58
LinearDiscriminantAnalysis 0.73 0.51 0.51 0.63 3.75
ExtraTreesClassifier 0.72 0.51 0.51 0.63 18.42
KNeighborsClassifier 0.68 0.51 0.51 0.63 2.55
RidgeClassifier 0.73 0.51 0.51 0.63 0.67
RidgeClassifierCV 0.73 0.51 0.51 0.63 2.70
SVC 0.73 0.51 0.51 0.62 97.66
Perceptron 0.62 0.53 0.53 0.62 0.54
ExtraTreeClassifier 0.62 0.51 0.51 0.62 0.66
CalibratedClassifierCV 0.73 0.50 0.50 0.62 92.76
DummyClassifier 0.73 0.50 0.50 0.61 0.55
PassiveAggressiveClassifier 0.61 0.52 0.52 0.61 0.62
NearestCentroid 0.59 0.53 0.53 0.60 0.92
LabelSpreading 0.48 0.54 0.54 0.51 101.71
LabelPropagation 0.48 0.54 0.54 0.51 56.63
QuadraticDiscriminantAnalysis 0.35 0.51 0.51 0.30 2.63
GaussianNB 0.28 0.50 0.50 0.15 0.74
In [46]:
predictions
Out[46]:
AdaBoostClassifier BaggingClassifier BernoulliNB CalibratedClassifierCV DecisionTreeClassifier DummyClassifier ExtraTreeClassifier ExtraTreesClassifier GaussianNB KNeighborsClassifier ... PassiveAggressiveClassifier Perceptron QuadraticDiscriminantAnalysis RandomForestClassifier RidgeClassifier RidgeClassifierCV SGDClassifier SVC XGBClassifier LGBMClassifier
0 1 0 1 1 1 1 1 1 0 1 ... 1 1 0 1 1 1 0 1 0 0
1 1 1 1 1 1 1 1 1 0 1 ... 0 1 0 1 1 1 1 1 1 1
2 1 1 1 1 0 1 1 1 0 1 ... 0 0 0 1 1 1 0 1 1 1
3 1 1 1 1 1 1 1 1 0 1 ... 1 1 0 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 0 1 ... 0 1 0 1 1 1 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3048 1 1 1 1 1 1 0 1 0 1 ... 0 1 1 1 1 1 1 1 1 1
3049 1 1 1 1 1 1 1 1 0 1 ... 1 0 1 1 1 1 1 1 1 1
3050 1 0 1 1 0 1 1 1 0 0 ... 0 1 0 1 1 1 1 1 0 0
3051 1 0 1 1 0 1 1 1 0 1 ... 0 1 0 1 1 1 1 1 0 0
3052 1 1 0 1 1 1 1 1 0 1 ... 1 1 0 1 1 1 1 1 1 1

3053 rows × 27 columns

GridSearch - Hyperparameter tuning¶

XGBoost had the highest baseline metrics, albeit too highm so let's just go ahead and tune that model to see if we can improve its performance. I'm a tad worried that there may have been data leakage, expecially with the fare variables... but what can we do =D

In [56]:
model_lazy_dictionary["XGBClassifier"]
Out[56]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  Index(['VendorID', 'passenger_count', 'trip_distance', 'total_amount',
       'duration_secs', 'RatecodeID_1', 'RatecodeID_2', 'RatecodeID_3',
       'RatecodeID_4', 'RatecodeID_5',
       ...
       'month_Jul', 'month_Aug', 'month_Sep', 'mon...
                               feature_types=None, gamma=None, gpu_id=None,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=100,
                               n_jobs=None, num_parallel_tree=None,
                               predictor=None, random_state=28, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  Index(['VendorID', 'passenger_count', 'trip_distance', 'total_amount',
       'duration_secs', 'RatecodeID_1', 'RatecodeID_2', 'RatecodeID_3',
       'RatecodeID_4', 'RatecodeID_5',
       ...
       'month_Jul', 'month_Aug', 'month_Sep', 'mon...
                               feature_types=None, gamma=None, gpu_id=None,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=100,
                               n_jobs=None, num_parallel_tree=None,
                               predictor=None, random_state=28, ...))])
ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('imputer', SimpleImputer()),
                                                 ('scaler', StandardScaler())]),
                                 Index(['VendorID', 'passenger_count', 'trip_distance', 'total_amount',
       'duration_secs', 'RatecodeID_1', 'RatecodeID_2', 'RatecodeID_3',
       'RatecodeID_4', 'RatecodeID_5',
       ...
       'month_Jul', 'month_Aug', 'month_Sep', 'month_Oct', 'month_Nov',
       'month_Dec',...
                                ('categorical_low',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('encoding',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False))]),
                                 Index([], dtype='object')),
                                ('categorical_high',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('encoding',
                                                  OrdinalEncoder())]),
                                 Index([], dtype='object'))])
Index(['VendorID', 'passenger_count', 'trip_distance', 'total_amount',
       'duration_secs', 'RatecodeID_1', 'RatecodeID_2', 'RatecodeID_3',
       'RatecodeID_4', 'RatecodeID_5',
       ...
       'month_Jul', 'month_Aug', 'month_Sep', 'month_Oct', 'month_Nov',
       'month_Dec', 'period_of_day2_Day-lighers',
       'period_of_day2_Evening Rush', 'period_of_day2_Morning Rush',
       'period_of_day2_Night owls'],
      dtype='object', length=351)
SimpleImputer()
StandardScaler()
Index([], dtype='object')
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore', sparse=False)
Index([], dtype='object')
SimpleImputer(fill_value='missing', strategy='constant')
OrdinalEncoder()
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=28, ...)
In [57]:
from xgboost import XGBClassifier
#from lightgbm import LGBMClassifier
#from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
#from sklearn.tree import DecisionTreeClassifier
import pickle

from sklearn.model_selection import GridSearchCV

model_xgb = XGBClassifier(objective = "binary:logistic", random_state = 28)
# model_lgbm = LGBMClassifier(random_state = 28)
# model_bc = BaggingClassifier(random_state = 28)
# model_rf = RandomForestClassifier(random_state = 28)
# model_dt = DecisionTreeClassifier(random_state = 28)
In [55]:
def make_results(model_name:str, model_object, metric:str):
    '''
    Arguments:
        model_name (string): what you want the model to be called in the output table
        model_object: a fit GridSearchCV object
        metric (string): precision, recall, f1, or accuracy
  
    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean 'metric' score across all validation folds.  
    '''

    # Create dictionary that maps input metric to actual metric name in GridSearchCV
    metric_dict = {'precision': 'mean_test_precision',
                 'recall': 'mean_test_recall',
                 'f1': 'mean_test_f1',
                 'accuracy': 'mean_test_accuracy',
                 }

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(metric) score
    best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]

    # Extract Accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy
  
    # Create table of results
    table = pd.DataFrame()
    table = table.append({'Model': model_name,
                        'Precision': precision,
                        'Recall': recall,
                        'F1': f1,
                        'Accuracy': accuracy,
                        },
                        ignore_index=True
                       )
  
    return table

Gride search with 4 fold cross validation

  • I'm prioritising the f1 score as the most import metric,
  • because this is the harmonic mean of recall and precision,
  • we want to reduce both type 1 and type 2 error so that neither the drivers and passengers are prejudiced
In [58]:
# parameters to be tried
cv_params = {'max_depth': [4,8,12], 
             'min_child_weight': [3, 5],
             'learning_rate': [0.01, 0.1],
             'n_estimators': [300, 500]
             } 

# scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Instantiate the GridSearchCV object
grid_xgb = GridSearchCV(model_xgb, cv_params, scoring = scoring, cv = 4, refit = 'f1')

Pickle our model

In [59]:
grid_xgb.fit(x_train, y_train)

with open('taxi_xgb_model.pickle', 'wb') as to_write:
    pickle.dump(grid_xgb, to_write)
In [ ]:
"""with open('taxi_xgb_model.pickle', 'rb') as to_read:
        grid_xgb = pickle.load(to_read)"""
In [60]:
grid_xgb.best_estimator_
Out[60]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=12, max_leaves=None,
              min_child_weight=3, missing=nan, monotone_constraints=None,
              n_estimators=500, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=28, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=12, max_leaves=None,
              min_child_weight=3, missing=nan, monotone_constraints=None,
              n_estimators=500, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=28, ...)
In [64]:
# slightly improved f1 score, from 93 on the baseline model

100*grid_xgb.best_score_
Out[64]:
95.29103138765164
In [62]:
grid_xgb.best_params_
Out[62]:
{'learning_rate': 0.1,
 'max_depth': 12,
 'min_child_weight': 3,
 'n_estimators': 500}

Evaluation¶

Classification Report¶

In [63]:
from yellowbrick.classifier import class_prediction_error, classification_report
In [67]:
classification_report(grid_xgb.best_estimator_ , x_train, y_train, x_test, y_test, support = 'percent', cmap = "Reds", fontsize = '16',
                      classes = ["normal tip", "good tip"])
Out[67]:
ClassificationReport(ax=<AxesSubplot: title={'center': 'XGBClassifier Classification Report'}>,
                     classes=['normal tip', 'good tip'],
                     cmap=<matplotlib.colors.ListedColormap object at 0x0000023A84853210>,
                     estimator=XGBClassifier(base_score=None, booster=None,
                                             callbacks=None,
                                             colsample_bylevel=None,
                                             colsample_bynode=None,
                                             colsample_bytree=None,
                                             early_stopping_rounds=None...
                                             importance_type=None,
                                             interaction_constraints=None,
                                             learning_rate=0.1, max_bin=None,
                                             max_cat_threshold=None,
                                             max_cat_to_onehot=None,
                                             max_delta_step=None, max_depth=12,
                                             max_leaves=None,
                                             min_child_weight=3, missing=nan,
                                             monotone_constraints=None,
                                             n_estimators=500, n_jobs=None,
                                             num_parallel_tree=None,
                                             predictor=None, random_state=28, ...),
                     fontsize='16', support='percent')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ClassificationReport(ax=<AxesSubplot: title={'center': 'XGBClassifier Classification Report'}>,
                     classes=['normal tip', 'good tip'],
                     cmap=<matplotlib.colors.ListedColormap object at 0x0000023A84853210>,
                     estimator=XGBClassifier(base_score=None, booster=None,
                                             callbacks=None,
                                             colsample_bylevel=None,
                                             colsample_bynode=None,
                                             colsample_bytree=None,
                                             early_stopping_rounds=None...
                                             importance_type=None,
                                             interaction_constraints=None,
                                             learning_rate=0.1, max_bin=None,
                                             max_cat_threshold=None,
                                             max_cat_to_onehot=None,
                                             max_delta_step=None, max_depth=12,
                                             max_leaves=None,
                                             min_child_weight=3, missing=nan,
                                             monotone_constraints=None,
                                             n_estimators=500, n_jobs=None,
                                             num_parallel_tree=None,
                                             predictor=None, random_state=28, ...),
                     fontsize='16', support='percent')
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=12, max_leaves=None,
              min_child_weight=3, missing=nan, monotone_constraints=None,
              n_estimators=500, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=28, ...)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=12, max_leaves=None,
              min_child_weight=3, missing=nan, monotone_constraints=None,
              n_estimators=500, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=28, ...)

Class Prediction Error¶

In [69]:
class_prediction_error(grid_xgb.best_estimator_, x_train, y_train, x_test, y_test, support = 'percent', cmap = "Red", fontsize = '16',
                      classes = ["normal tip", "good tip"])
Out[69]:
ClassPredictionError(ax=<AxesSubplot: title={'center': 'Class Prediction Error for XGBClassifier'}, xlabel='actual class', ylabel='number of predicted class'>,
                     classes=['normal tip', 'good tip'],
                     estimator=XGBClassifier(base_score=None, booster=None,
                                             callbacks=None,
                                             colsample_bylevel=None,
                                             colsample_bynode=None,
                                             colsample_bytree=None,
                                             early_stopping_rounds=None,
                                             enable_ca...
                                             gpu_id=None, grow_policy=None,
                                             importance_type=None,
                                             interaction_constraints=None,
                                             learning_rate=0.1, max_bin=None,
                                             max_cat_threshold=None,
                                             max_cat_to_onehot=None,
                                             max_delta_step=None, max_depth=12,
                                             max_leaves=None,
                                             min_child_weight=3, missing=nan,
                                             monotone_constraints=None,
                                             n_estimators=500, n_jobs=None,
                                             num_parallel_tree=None,
                                             predictor=None, random_state=28, ...))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ClassPredictionError(ax=<AxesSubplot: title={'center': 'Class Prediction Error for XGBClassifier'}, xlabel='actual class', ylabel='number of predicted class'>,
                     classes=['normal tip', 'good tip'],
                     estimator=XGBClassifier(base_score=None, booster=None,
                                             callbacks=None,
                                             colsample_bylevel=None,
                                             colsample_bynode=None,
                                             colsample_bytree=None,
                                             early_stopping_rounds=None,
                                             enable_ca...
                                             gpu_id=None, grow_policy=None,
                                             importance_type=None,
                                             interaction_constraints=None,
                                             learning_rate=0.1, max_bin=None,
                                             max_cat_threshold=None,
                                             max_cat_to_onehot=None,
                                             max_delta_step=None, max_depth=12,
                                             max_leaves=None,
                                             min_child_weight=3, missing=nan,
                                             monotone_constraints=None,
                                             n_estimators=500, n_jobs=None,
                                             num_parallel_tree=None,
                                             predictor=None, random_state=28, ...))
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=12, max_leaves=None,
              min_child_weight=3, missing=nan, monotone_constraints=None,
              n_estimators=500, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=28, ...)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=12, max_leaves=None,
              min_child_weight=3, missing=nan, monotone_constraints=None,
              n_estimators=500, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=28, ...)
In [79]:
100*accuracy_score(y_test, grid_xgb.best_estimator_.predict(x_test))
Out[79]:
93.21978381919423

As seen on the classification report and class prediction error visuals (similar to confusion matrix), the predicted class is unbalanced, leading to low support and lower metrics for the passengers who tipped less than 20% of their fare amount. Regardless, the model will give the correct prediction in more than 9 out of 10 rides (93% of cases to be specific).

Most Important Variables¶

In [71]:
from xgboost import plot_importance

plot_importance(grid_xgb.best_estimator_, max_num_features=10)
Out[71]:
<AxesSubplot: title={'center': 'Feature importance'}, xlabel='F score', ylabel='Features'>
  • That will be our model for deployment
  • the total amount variable I was worried about has more than 50% the f1 score of the next variable, but yay! Great model!
In [ ]:
import os, IPython
In [93]:
%%javascript
IPython.notebook.save_notebook()

os.system("jupyter nbconvert --execute --to html 'Taxi Duration.ipynb'")